Grammars and Parsers

Let’s build a parser to math expressions. It’s better start the job in very a simple way: process addition expressions like ‘1 + 1’.

Well… As the old Chinese saying goes, “a journey of one thousand miles begins with one tiny step”.

A Little Bit About Parsers

According to Grune and Jacobs[1]:

“Parsing is the process of structuring a linear representation in accordance with a given grammar. [..] This ‘linear representation’ may be a sentence, a computer program, [..]  in short any linear sequence in which the preceding elements in some way restrict the next element”[1, p. 3].

Example of Parsing
Image via Wikipedia

“To a computer scientist ‘1 + 2 * 3’ is a sentence in the language of ‘arithmetics on single digits’ [..], its structure can be shown, for instance, by inserting parentheses: (1 + (2 * 3)) and its semantics [i.e. its meaning] is probably 7”[1, p. 7].

We use a parser to build an expression tree. This tree contains the elements of a sentence according with the structure of a given grammar and in the correct order (afterwards we will need an interpreter that uses the tree to extract the semantic of the sentence).

And in order to implement a parser, we have to:

  1. Define a grammar to the sentences the parser will process;
  2. Decide which kind of parser to implement;
  3. Implement the parser to build the expression tree;
  4. And finally interpret the expression tree and get the result (there will be another post showing one way of doing that).

1) Defining The Grammar

What we’re trying to do here is to describe a possible infinite set of sequence of symbols (the language) through the grammar (a finite recipe).”A formal grammar is a set of rules of a specific kind for forming strings in a [..] language” [2]. The grammar to generate single digits sentences (addition only) like 9 + 2 + 3 may be:

Exp -> Add | Int
Add -> Add + Int | Int
Int -> 0 | 1 | ... | 9

This small grammar represents an infinite set of sentences like 2 + 2 and 7 + 4 + 6 + 4. But sentences like +1, 1 ++ 3 or even 44 and 3 * 3 are not allowed. (This is a Type 2 grammar from the Chomsky hierarchy, also called a context-free grammar, which requires only one nonterminal on the left-hand side.)

The -> in the notation means that you can substitute the symbols on the left-hand side with the ones on the right-hand side; the | can be read as “or else”. The grammars are composed by terminals and nonterminals symbols[1].

Terminals “are the elementary symbols of the language defined by a formal grammar”[3]; in our case, the digits 0, 1, 2, 3, …, 9, and +.

Nonterminals “are the symbols which can be replaced”.[3] The grammar definition above states that:

  • Exp can be replaced by Add or Int, both nonterminals as well;
  • Add may be replaced by another Add followed by + and Int, or just by an Int;
  • Int must be replaced by single digits from 0 to 9.

Follows the derivation of the sentence 9 + 2 + 3 from the grammar we just defined:

Intermediate Form Application of Rule…
Exp The start symbol
Add Exp -> Add
Add + Int Add -> Add + Int
Add + Int + Int Add -> Add + Int
Int + Int + Int Add -> Int
9 + Int + Int Int -> 9
9 + 2 + Int Int -> 2
9 + 2 + 3 Int -> 3

2) Deciding Which Kind of Parser To Implement

The idea is to have a table-driven parser since this kind of parser can return an abstract syntax tree and is very well know by the computer science community. Two of the most common table-driven parsers are the LL and LR parsers.

Although a LR parser is more powerful and flexible, capable of handling much more languages than a LL parser, with better error handling and detection, it’s more difficult to build a table for it, which is often done by a parser generator[4]. The LL is good enough for our purposes since it is easier to implement and will do the job.

LL(1) Grammar

The parser we’ll implement is more specifically called a LL(1) parser because it will use only one token of lookahead when parsing a sentence[5]. A LL(1) parser needs a LL(1) grammar.

Basically, a context-free grammar without left recursion is a LL(1) grammar*. This restriction aim to avoid conflicts and therefore backtracking, which means “select the correct rule so that it constructs the tree without having to go back and try a different rule if the one it selects is not part of the derivation of the input sentence”[6].

Eliminating Left Recursion from A Context-Free Grammar[7]

Our grammar has a left recursion on the second rule that must be eliminated.

Add -> Add + Int | Int

If Add is derived to Add + Int, there’s still the same non-terminal Add on the left and this situation certainly will lead to an infinite recursion. Eliminating left recursion means[7]:

1) For every rule that exist a left recursion (Add), move the non-recursive part of this rule (+ Int) to a new rule called Add'.

Add' -> + Int

2) Put Add' at the end of this new rule.

Add' -> + Int Add'

3) Place a special symbol called epsilon, which means nothing, as an alternative substitution.

Add' -> + Int Add' | epsilon

4) Then fix first rule Exp -> Add | Int removing Add and placing Add' in front of Int.

Exp -> Int Add'

Here’s our grammar for addition operations converted into a LL(1) grammar:

Exp -> Int Add'
Add' -> + Int Add' | epsilon
Int -> 0 | 1 | ... | 9

Observe that there’s no need to change the Int production rule because it isn’t left recursive. Get rid of left recursion in this grammar is sufficient since it does not contains left common factors.

3) Implementing the Parser

Before starting coding the LL(1) parser, we have to define its table.

What’s this Table that Drives the Parser? The LL(1) Parser Table[7]

A parser table is a lookup table that parsers use to know the next action to take, i.e. how to derive the next token of the input string. Every lookup table has a key associated with a value. In a LL(1) parser table, the key is formed by the terminals on the top and nonterminals on the left.

+ 0 1 2 ... 9 $
Exp
Add'
Int

The terminals of our grammar are +, the digits from 0 to nine and $ (end-of-input). There’s also the special symbol epsilon (empty). The nonterminals are Exp, Add' and Int.

We’ll fill this table using two functions called First and Follow on every nonterminal of the grammar.

The First Function

The First function evaluates which terminals can appear first when deriving a nonterminal.The result of First(Exp) is First(Int) because the right side (the production rule) for the start symbol Exp is Int Add'. The first terminal that appears when deriving Add can only be the same as First(Int).

First(Exp) => First(Int)

The derivation rule for Add' is + Int Add' | epsilon, so the first function value in this case is ‘+’ or ‘epsilon’.

First(Add') = { + , epsilon }

The digits from zero to nine are the terminals that appear first when deriving the nonterminal Int:

First(Int) = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }

Therefore we have:

First(Exp) = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }
First(Add') = { + , epsilon }
First(Int) = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }

The Follow Function

The Follow function for a given nonterminal reveals which terminals can appear after the derivation of this same nonterminal. It is difficult to see the value for the Follow(Exp) function, thus we create an imaginary production rule to help us see which terminals can appear immediately after the derivation of Add:

S -> Exp $

This imaginary newly created start rule S derive the original start production rule Exp followed by the end-of-input symbol $, so:

Follow(Exp) = { $ }

Follow(Add') is a little trickier. It may also seem difficult to see what follows Add' once there isn’t a nonterminal after the Add' in the rules of the grammar. However, consider that Add' can be derived from Exp -> Int Add', so whatever comes after Add' it comes after Exp too. Then we have:

Follow(Add') => Follow(Exp)

Finally, the value of Follow(Int) is First(Add') whereas Int is always followed for Add':

Follow(Int) = { + , epsilon }

Ops! A value epsilon (empty) for a Follow function doesn’t exist; after all at least an end-of-input ($) must follow in the sentence. This epsilon came from First(Add'). If Add' -> epsilon, then the terminal that follows Int can only be what follows Add', that is the $ symbol. Then:

Follow(Int) = { + , $ }

The Follow function application in all nonterminals results in:

Follow(Exp) = { $ }
Follow(Add') = { $ }
Follow(Int) = { + , $ }

Filling the Parser Table

Let’s fill the table with the values of the First and Follow functions .

Take the results of the First function to fill the table and, when the value for a given nonterminal is an epsilon, use the result of the Follow function for this same nonterminal.

+ 0 1 2 ... 9 $
Exp Exp -> Int Add' Exp -> Int Add' Exp -> Int Add' ... Exp -> Int Add'
Add' Add' -> + Int Add' ... Add' -> epsilon
Int Int -> 0 Int -> 1 Int -> 2 ... Int -> 9

Kicking off with the first line of the table, First(Exp) results in terminals from 0 to 9, thus we take the intersections of the line Exp with the columns from 0 to 9 (the values of First(Exp)) and fill them with the production rule of ExpExp -> Int Add'.

First(Add') results in an epsilon or +. Since there’s no column for epsilon in the parser table (epsilon isn’t a terminal), the proper value to use is Follow(Add'): $.

Proceeding that way will lead to the table above. There’s a software called ParsingEmu that you can use to check your grammar definition, the correctness of a parser table, the values of First and Follow functions and more.

Derivation Revisited

The table below shows how the derivation of the sentence 9 + 2 + 3 takes place inside a LL(1) table-driven parser.

Input stream Stack Derivation
9 + 2 + 3 $ Exp Exp -> Int Add'
9 + 2 + 3 $ Add' Int Int -> 9
9 + 2 + 3 $ Add' 9
+ 2 + 3 $ Add' Add' -> + Int Add'
+ 2 + 3 $ Add' Int +
2 + 3 $ Add' Int Int -> 2
2 + 3 $ Add' 2
+ 3 $ Add' Add' -> + Int Add'
+ 3 $ Add' Int +
3 $ Add' Int Int -> 3
3 $ Add' 3
$ Add' Add’ -> epsilon
$ Yesss!! Bananas!

Implementation

You can check an example of a parser implementation and its spec. Ptolomeu web page shows the current grammar supported by this parser.

Notes

* A more strict definition of LL(1) grammars found here is: “A grammar is said to be an LL(1) grammar if for every pair (A,a), of a nonterminal symbol A and a lookahead symbol a, there is at most one choice of a production rule A → α to be made.”

References

  1. Dick Grune, Ceriel Jacobs. Parsing Techniques: A Practical Guide. Ellis Horwood Ltd., 1990. ISBN: 0136514316.
  2. Formal grammar. (2011, February 3). In Wikipedia, The Free Encyclopedia. Retrieved 01:45, February 8, 2011, from http://en.wikipedia.org/w/index.php?title=Formal_grammar&oldid=411737777
  3. Terminal and nonterminal symbols. (2011, January 20). In Wikipedia, The Free Encyclopedia. Retrieved 03:59, February 8, 2011, from http://en.wikipedia.org/w/index.php?title=Terminal_and_nonterminal_symbols&oldid=408996680
  4. LR parser. (2010, November 28). In Wikipedia, The Free Encyclopedia. Retrieved 20:25, February 11, 2011, from http://en.wikipedia.org/w/index.php?title=LR_parser&oldid=399248916
  5. LL parser. (2010, November 27). In Wikipedia, The Free Encyclopedia. Retrieved 21:01, February 11, 2011, from http://en.wikipedia.org/w/index.php?title=LL_parser&oldid=399215667
  6. Predictive Parser. (2005, September 9). In Worcester Polytechnic Institute. Retrieved 21:27, February 11, 2011, from http://web.cs.wpi.edu/~gpollice/cs544-f05/CourseNotes/maps/Class5/PredictiveParsers.html
  7. LL (1) Parsing. In Microsoft Research. Retrieved o4:44, February 12, 2011, from http://research.microsoft.com/en-us/um/people/abegel/cs164/ll1.html
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s