Parsley Tutorial Part II: Parsing Structured DataΒΆ

Now that you are familiar with the basics of Parsley syntax, let’s look at a more realistic example: a JSON parser.

The JSON spec on http://json.org/ describes the format, and we can adapt its description to a parser. We’ll write the Parsley rules in the same order as the grammar rules in the right sidebar on the JSON site, starting with the top-level rule, ‘object’.

object = ws '{' members:m ws '}' -> dict(m)

Parsley defines a builtin rule ws which consumes any spaces, tabs, or newlines it can.

Since JSON objects are represented in Python as dicts, and dict takes a list of pairs, we need a rule to collect name/value pairs inside an object expression.

members = (pair:first (ws ',' pair)*:rest -> [first] + rest)
          | -> []

This handles the three cases for object contents: one, multiple, or zero pairs. A name/value pair is separated by a colon. We use the builtin rule spaces to consume any whitespace after the colon:

pair = ws string:k ws ':' value:v -> (k, v)

Arrays, similarly, are sequences of array elements, and are represented as Python lists.

array = '[' elements:xs ws ']' -> xs
elements = (value:first (ws ',' value)*:rest -> [first] + rest) | -> []

Values can be any JSON expression.

value = ws (string | number | object | array
           | 'true'  -> True
           | 'false' -> False
           | 'null'  -> None)

Strings are sequences of zero or more characters between double quotes. Of course, we need to deal with escaped characters as well. This rule introduces the operator ~, which does negative lookahead; if the expression following it succeeds, its parse will fail. If the expression fails, the rest of the parse continues. Either way, no input will be consumed.

string = '"' (escapedChar | ~'"' anything)*:c '"' -> ''.join(c)

This is a common pattern, so let’s examine it step by step. This will match leading whitespace and then a double quote character. It then matches zero or more characters. If it’s not an escapedChar (which will start with a backslash), we check to see if it’s a double quote, in which case we want to end the loop. If it’s not a double quote, we match it using the rule anything, which accepts a single character of any kind, and continue. Finally, we match the ending double quote and return the characters in the string. We cannot use the <> syntax in this case because we don’t want a literal slice of the input – we want escape sequences to be replaced with the character they represent.

It’s very common to use ~ for “match until” situations where you want to keep parsing only until an end marker is found. Similarly, ~~ is positive lookahead: it succeed if its expression succeeds but not consume any input.

The escapedChar rule should not be too surprising: we match a backslash then whatever escape code is given.

escapedChar = '\\' (('"' -> '"')    |('\\' -> '\\')
                   |('/' -> '/')    |('b' -> '\b')
                   |('f' -> '\f')   |('n' -> '\n')
                   |('r' -> '\r')   |('t' -> '\t')
                   |('\'' -> '\'')  | escapedUnicode)

Unicode escapes (of the form \u2603) require matching four hex digits, so we use the repetition operator {}, which works like + or * except taking either a {min, max} pair or simply a {number} indicating the exact number of repetitions.

hexdigit = :x ?(x in '0123456789abcdefABCDEF') -> x
escapedUnicode = 'u' <hexdigit{4}>:hs -> unichr(int(hs, 16))

With strings out of the way, we advance to numbers, both integer and floating-point.

number = spaces ('-' | -> ''):sign (intPart:ds (floatPart(sign ds)
                                               | -> int(sign + ds)))

Here we vary from the json.org description a little and move sign handling up into the number rule. We match either an intPart followed by a floatPart or just an intPart by itself.

digit = :x ?(x in '0123456789') -> x
digits = <digit*>
digit1_9 = :x ?(x in '123456789') -> x

intPart = (digit1_9:first digits:rest -> first + rest) | digit
floatPart :sign :ds = <('.' digits exponent?) | exponent>:tail
                     -> float(sign + ds + tail)
exponent = ('e' | 'E') ('+' | '-')? digits

In JSON, multi-digit numbers cannot start with 0 (since that is Javascript’s syntax for octal numbers), so intPart uses digit1_9 to exclude it in the first position.

The floatPart rule takes two parameters, sign and ds. Our number rule passes values for these when it invokes floatPart, letting us avoid duplication of work within the rule. Note that pattern matching on arguments to rules works the same as on the string input to the parser. In this case, we provide no pattern, just a name: :ds is the same as anything:ds.

(Also note that our float rule cheats a little: it does not really parse floating-point numbers, it merely recognizes them and passes them to Python’s float builtin to actually produce the value.)

The full version of this parser and its test cases can be found in the examples directory in the Parsley distribution.