A Tiny Introduction to Parsers


A Question

The other day, someone asked how parsers worked on The Programmer's Hangout Discord server. Here's an an abridged version of my explanation.

My Response

A parser simply takes a stream of data and from it produces a datastructure. This stream is usually a stream of tokens, which are groups of tagged delimiters and data used to construct the datastructure.

Let's assume you're building a parser for a programming language; the parser converts a file to an abstract syntax tree. We start with a file; in this case, it's AverageScript:

for i in [1, 2, 3] {
    print(i)
}

And lex it into a series of tokens by walking over the file linearly. So, for instance:

for: keyword
i:   identifier
in:  keyword
[:   delimiter
1:   number
,:   separator
2:   number
     ... snip ...
}:   delimiter

This is the stream of tokens that will be fed into the parser to form the ast. So, how exactly does a parser work?

There are quite a few different kinds of parsers, but they all essentially reduce to the same set of steps:

Most parsers are recursive for this reason; for example, in most languages, a function call can contain another function call:

foo(bar(1, 3))

Instead of manually writing out cases for nesting, after we determine we're looking at a function, we simply parse the contents of foo(...) according to some pre-defined rule.

So returning to our original case: how might a parser parse the for i in ...?

Assuming some form of a recursive descent parser:

  1. The parser sees for and determines a for loop is being parsed; In this made-up language, a for loop has three components, an identifier, an iterator, and a body.
  2. It takes the next token, i and checks that it's an identifier.
  3. It takes the next token, in and makes sure it's the in keyword.
  4. It recursively evaluates the next series of tokens as an expression
    1. It sees [ so it starts parsing as a list
    2. It sees 2 and adds it to said list
    3. And so on until the entire list is parsed
  5. It checks for { delimiting a for body
  6. It recursively parses the next few tokens as a block of code
    1. and so on until the entire code block is parsed
  7. It checks for }, ending the for loop
  8. It constructs a for node and returns it.

The end of this process might result in an ast that looks like so:

For {
    identifier: i,
    expression: List {
        items: [1, 2, 3]
    },
    body: Code {
        statements: [
            Function {
                identifier: print,
                arguments: [i],
            },
        ],
    },
}

This ast can then be walked, interpreted, compiled, etc. There exist common formats to describe parsers; one such format is Extended Backus-Naur Form. Many parsing techniques exist; what I described models a ll(1) parser. Many parsing models exist; to learn more about them, I recommend you check the Parsing Wikipedia page.

So, to summarize:

I hope this helped clear some things up 🙂.