# Grammars and Parsers

Let’s build a parser to math expressions. It’s better start the job in very a simple way: process addition expressions like ‘1 + 1’.

Well… As the old Chinese saying goes, “a journey of one thousand miles begins with one tiny step”.

### A Little Bit About Parsers

According to Grune and Jacobs:

“Parsing is the process of structuring a linear representation in accordance with a given grammar. [..] This ‘linear representation’ may be a sentence, a computer program, [..]  in short any linear sequence in which the preceding elements in some way restrict the next element”[1, p. 3].

“To a computer scientist ‘1 + 2 * 3’ is a sentence in the language of ‘arithmetics on single digits’ [..], its structure can be shown, for instance, by inserting parentheses: (1 + (2 * 3)) and its semantics [i.e. its meaning] is probably 7”[1, p. 7].

We use a parser to build an expression tree. This tree contains the elements of a sentence according with the structure of a given grammar and in the correct order (afterwards we will need an interpreter that uses the tree to extract the semantic of the sentence).

And in order to implement a parser, we have to:

1. Define a grammar to the sentences the parser will process;
2. Decide which kind of parser to implement;
3. Implement the parser to build the expression tree;
4. And finally interpret the expression tree and get the result (there will be another post showing one way of doing that).

### 1) Defining The Grammar

What we’re trying to do here is to describe a possible infinite set of sequence of symbols (the language) through the grammar (a finite recipe).”A formal grammar is a set of rules of a specific kind for forming strings in a [..] language” . The grammar to generate single digits sentences (addition only) like 9 + 2 + 3 may be:

```Exp -> Add | Int
Int -> 0 | 1 | ... | 9
```

This small grammar represents an infinite set of sentences like 2 + 2 and 7 + 4 + 6 + 4. But sentences like +1, 1 ++ 3 or even 44 and 3 * 3 are not allowed. (This is a Type 2 grammar from the Chomsky hierarchy, also called a context-free grammar, which requires only one nonterminal on the left-hand side.)

The `->` in the notation means that you can substitute the symbols on the left-hand side with the ones on the right-hand side; the | can be read as “or else”. The grammars are composed by terminals and nonterminals symbols.

Terminals “are the elementary symbols of the language defined by a formal grammar”; in our case, the digits `0`, `1`, `2`, `3`, …, `9`, and `+`.

Nonterminals “are the symbols which can be replaced”. The grammar definition above states that:

• `Exp` can be replaced by `Add` or `Int`, both nonterminals as well;
• `Add` may be replaced by another `Add` followed by `+` and `Int`, or just by an `Int;`
• `Int` must be replaced by single digits from `0` to `9`.

Follows the derivation of the sentence 9 + 2 + 3 from the grammar we just defined:

 Intermediate Form Application of Rule… Exp The start symbol Add Exp -> Add Add + Int Add -> Add + Int Add + Int + Int Add -> Add + Int Int + Int + Int Add -> Int 9 + Int + Int Int -> 9 9 + 2 + Int Int -> 2 9 + 2 + 3 Int -> 3

### 2) Deciding Which Kind of Parser To Implement

The idea is to have a table-driven parser since this kind of parser can return an abstract syntax tree and is very well know by the computer science community. Two of the most common table-driven parsers are the LL and LR parsers.

Although a LR parser is more powerful and flexible, capable of handling much more languages than a LL parser, with better error handling and detection, it’s more difficult to build a table for it, which is often done by a parser generator. The LL is good enough for our purposes since it is easier to implement and will do the job.

#### LL(1) Grammar

The parser we’ll implement is more specifically called a LL(1) parser because it will use only one token of lookahead when parsing a sentence. A LL(1) parser needs a LL(1) grammar.

Basically, a context-free grammar without left recursion is a LL(1) grammar*. This restriction aim to avoid conflicts and therefore backtracking, which means “select the correct rule so that it constructs the tree without having to go back and try a different rule if the one it selects is not part of the derivation of the input sentence”.

#### Eliminating Left Recursion from A Context-Free Grammar

Our grammar has a left recursion on the second rule that must be eliminated.

`Add -> Add + Int | Int`

If `Add` is derived to `Add + Int`, there’s still the same non-terminal `Add` on the left and this situation certainly will lead to an infinite recursion. Eliminating left recursion means:

1) For every rule that exist a left recursion (`Add`), move the non-recursive part of this rule (`+ Int`) to a new rule called `Add'`.

`Add' -> + Int`

2) Put `Add'` at the end of this new rule.

`Add' -> + Int Add'`

3) Place a special symbol called epsilon, which means nothing, as an alternative substitution.

`Add' -> + Int Add' | epsilon`

4) Then fix first rule `Exp -> Add | Int` removing `Add` and placing `Add'` in front of `Int`.

`Exp -> Int Add'`

Here’s our grammar for addition operations converted into a LL(1) grammar:

```Exp -> Int Add'
Int -> 0 | 1 | ... | 9
```

Observe that there’s no need to change the `Int` production rule because it isn’t left recursive. Get rid of left recursion in this grammar is sufficient since it does not contains left common factors.

### 3) Implementing the Parser

Before starting coding the LL(1) parser, we have to define its table.

#### What’s this Table that Drives the Parser? The LL(1) Parser Table

A parser table is a lookup table that parsers use to know the next action to take, i.e. how to derive the next token of the input string. Every lookup table has a key associated with a value. In a LL(1) parser table, the key is formed by the terminals on the top and nonterminals on the left.

 `+` `0` `1` `2` `...` `9` `\$` `Exp` `Add'` `Int`

The terminals of our grammar are `+`, the digits from `0` to nine and `\$` (end-of-input). There’s also the special symbol `epsilon` (empty). The nonterminals are `Exp`, `Add'` and `Int`.

We’ll fill this table using two functions called First and Follow on every nonterminal of the grammar.

#### The First Function

The First function evaluates which terminals can appear first when deriving a nonterminal.The result of `First(Exp)` is `First(Int)` because the right side (the production rule) for the start symbol `Exp` is `Int Add'`. The first terminal that appears when deriving `Add` can only be the same as `First(Int)`.

`First(Exp) => First(Int)`

The derivation rule for `Add'` is `+ Int Add' | epsilon`, so the first function value in this case is ‘+’ or ‘epsilon’.

`First(Add') = { + , epsilon }`

The digits from zero to nine are the terminals that appear first when deriving the nonterminal `Int`:

`First(Int) = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }`

Therefore we have:

`First(Exp) = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }`
``` First(Add') = { + , epsilon } First(Int) = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }```

The Follow function for a given nonterminal reveals which terminals can appear after the derivation of this same nonterminal. It is difficult to see the value for the `Follow(Exp)` function, thus we create an imaginary production rule to help us see which terminals can appear immediately after the derivation of `Add`:

`S -> Exp \$`

This imaginary newly created start rule `S` derive the original start production rule `Exp` followed by the end-of-input symbol `\$`, so:

`Follow(Exp) = { \$ }`

`Follow(Add')` is a little trickier. It may also seem difficult to see what follows `Add'` once there isn’t a nonterminal after the `Add'` in the rules of the grammar. However, consider that `Add'` can be derived from `Exp -> Int Add'`, so whatever comes after `Add'` it comes after `Exp` too. Then we have:

```Follow(Add') => Follow(Exp) ```

Finally, the value of `Follow(Int)` is `First(Add')` whereas `Int` is always followed for `Add'`:

`Follow(Int) = { + , epsilon }`

Ops! A value `epsilon` (empty) for a Follow function doesn’t exist; after all at least an end-of-input (`\$`) must follow in the sentence. This `epsilon` came from `First(Add')`. If `Add' -> epsilon`, then the terminal that follows `Int` can only be what follows `Add'`, that is the `\$` symbol. Then:

`Follow(Int) = { + , \$ }`

The Follow function application in all nonterminals results in:

```Follow(Exp) = { \$ } Follow(Add') = { \$ } Follow(Int) = { + , \$ } ```

#### Filling the Parser Table

Let’s fill the table with the values of the First and Follow functions .

Take the results of the First function to fill the table and, when the value for a given nonterminal is an `epsilon`, use the result of the Follow function for this same nonterminal.

 `+` `0` `1` `2` `...` `9` `\$` `Exp` – `Exp -> Int Add'` `Exp -> Int Add'` `Exp -> Int Add'` `...` `Exp -> Int Add'` – `Add'` `Add' -> + Int Add'` – – – `...` – `Add' -> epsilon` `Int` – `Int -> 0` `Int -> 1` `Int -> 2` `...` `Int -> 9` –

Kicking off with the first line of the table, `First(Exp)` results in terminals from `0` to `9`, thus we take the intersections of the line `Exp` with the columns from `0` to `9` (the values of `First(Exp)`) and fill them with the production rule of `Exp``Exp -> Int Add'`.

`First(Add')` results in an `epsilon` or `+`. Since there’s no column for `epsilon` in the parser table (`epsilon` isn’t a terminal), the proper value to use is `Follow(Add')`: `\$`.

Proceeding that way will lead to the table above. There’s a software called ParsingEmu that you can use to check your grammar definition, the correctness of a parser table, the values of First and Follow functions and more.

#### Derivation Revisited

The table below shows how the derivation of the sentence 9 + 2 + 3 takes place inside a LL(1) table-driven parser.

 `Input stream` `Stack` `Derivation ` 9 + 2 + 3 \$ `Exp` `Exp -> Int Add' ` 9 + 2 + 3 \$ `Add' Int` `Int -> 9` 9 + 2 + 3 \$ `Add' 9` + 2 + 3 \$ `Add'` `Add' -> + Int Add'` + 2 + 3 \$ `Add' Int +` 2 + 3 \$ `Add' Int` `Int -> 2` 2 + 3 \$ `Add' 2` + 3 \$ `Add'` `Add' -> + Int Add'` + 3 \$ `Add' Int +` 3 \$ `Add' Int` `Int -> 3` 3 \$ `Add' 3` \$ `Add'` Add’ -> epsilon \$ Yesss!! Bananas!

#### Implementation

You can check an example of a parser implementation and its spec. Ptolomeu web page shows the current grammar supported by this parser.

## Notes

* A more strict definition of LL(1) grammars found here is: “A grammar is said to be an LL(1) grammar if for every pair (A,a), of a nonterminal symbol A and a lookahead symbol a, there is at most one choice of a production rule A → α to be made.”

## References

1. Dick Grune, Ceriel Jacobs. Parsing Techniques: A Practical Guide. Ellis Horwood Ltd., 1990. ISBN: 0136514316.
2. Formal grammar. (2011, February 3). In Wikipedia, The Free Encyclopedia. Retrieved 01:45, February 8, 2011, from http://en.wikipedia.org/w/index.php?title=Formal_grammar&oldid=411737777
3. Terminal and nonterminal symbols. (2011, January 20). In Wikipedia, The Free Encyclopedia. Retrieved 03:59, February 8, 2011, from http://en.wikipedia.org/w/index.php?title=Terminal_and_nonterminal_symbols&oldid=408996680
4. LR parser. (2010, November 28). In Wikipedia, The Free Encyclopedia. Retrieved 20:25, February 11, 2011, from http://en.wikipedia.org/w/index.php?title=LR_parser&oldid=399248916
5. LL parser. (2010, November 27). In Wikipedia, The Free Encyclopedia. Retrieved 21:01, February 11, 2011, from http://en.wikipedia.org/w/index.php?title=LL_parser&oldid=399215667
6. Predictive Parser. (2005, September 9). In Worcester Polytechnic Institute. Retrieved 21:27, February 11, 2011, from http://web.cs.wpi.edu/~gpollice/cs544-f05/CourseNotes/maps/Class5/PredictiveParsers.html
7. LL (1) Parsing. In Microsoft Research. Retrieved o4:44, February 12, 2011, from http://research.microsoft.com/en-us/um/people/abegel/cs164/ll1.html