Let’s build a parser to math expressions. It’s better start the job in very a simple way: process addition expressions like ‘1 + 1’.
Well… As the old Chinese saying goes, “a journey of one thousand miles begins with one tiny step”.
A Little Bit About Parsers
According to Grune and Jacobs^{[1]:}
“Parsing is the process of structuring a linear representation in accordance with a given grammar. [..] This ‘linear representation’ may be a sentence, a computer program, [..] in short any linear sequence in which the preceding elements in some way restrict the next element”^{[1, p. 3]}.
“To a computer scientist ‘1 + 2 * 3’ is a sentence in the language of ‘arithmetics on single digits’ [..], its structure can be shown, for instance, by inserting parentheses: (1 + (2 * 3)) and its semantics [i.e. its meaning] is probably 7”^{[1, p. 7]}.
We use a parser to build an expression tree. This tree contains the elements of a sentence according with the structure of a given grammar and in the correct order (afterwards we will need an interpreter that uses the tree to extract the semantic of the sentence).
And in order to implement a parser, we have to:
 Define a grammar to the sentences the parser will process;
 Decide which kind of parser to implement;
 Implement the parser to build the expression tree;
 And finally interpret the expression tree and get the result (there will be another post showing one way of doing that).
1) Defining The Grammar
What we’re trying to do here is to describe a possible infinite set of sequence of symbols (the language) through the grammar (a finite recipe).”A formal grammar is a set of rules of a specific kind for forming strings in a [..] language” ^{[2]}. The grammar to generate single digits sentences (addition only) like 9 + 2 + 3 may be:
Exp > Add  Int Add > Add + Int  Int Int > 0  1  ...  9
This small grammar represents an infinite set of sentences like 2 + 2 and 7 + 4 + 6 + 4. But sentences like +1, 1 ++ 3 or even 44 and 3 * 3 are not allowed. (This is a Type 2 grammar from the Chomsky hierarchy, also called a contextfree grammar, which requires only one nonterminal on the lefthand side.)
The >
in the notation means that you can substitute the symbols on the lefthand side with the ones on the righthand side; the  can be read as “or else”. The grammars are composed by terminals and nonterminals symbols^{[1]}.
Terminals “are the elementary symbols of the language defined by a formal grammar”^{[3]}; in our case, the digits 0
, 1
, 2
, 3
, …, 9
, and +
.
Nonterminals “are the symbols which can be replaced”.^{[3]} The grammar definition above states that:
Exp
can be replaced byAdd
orInt
, both nonterminals as well;Add
may be replaced by anotherAdd
followed by+
andInt
, or just by anInt;
Int
must be replaced by single digits from0
to9
.
Follows the derivation of the sentence 9 + 2 + 3 from the grammar we just defined:
Intermediate Form  Application of Rule… 
Exp  The start symbol 
Add  Exp > Add 
Add + Int  Add > Add + Int 
Add + Int + Int  Add > Add + Int 
Int + Int + Int  Add > Int 
9 + Int + Int  Int > 9 
9 + 2 + Int  Int > 2 
9 + 2 + 3  Int > 3 
2) Deciding Which Kind of Parser To Implement
The idea is to have a tabledriven parser since this kind of parser can return an abstract syntax tree and is very well know by the computer science community. Two of the most common tabledriven parsers are the LL and LR parsers.
Although a LR parser is more powerful and flexible, capable of handling much more languages than a LL parser, with better error handling and detection, it’s more difficult to build a table for it, which is often done by a parser generator^{[4]}. The LL is good enough for our purposes since it is easier to implement and will do the job.
LL(1) Grammar
The parser we’ll implement is more specifically called a LL(1) parser because it will use only one token of lookahead when parsing a sentence^{[5]}. A LL(1) parser needs a LL(1) grammar.
Basically, a contextfree grammar without left recursion is a LL(1) grammar*. This restriction aim to avoid conflicts and therefore backtracking, which means “select the correct rule so that it constructs the tree without having to go back and try a different rule if the one it selects is not part of the derivation of the input sentence”^{[6]}.
Eliminating Left Recursion from A ContextFree Grammar^{[7]}
Our grammar has a left recursion on the second rule that must be eliminated.
Add > Add + Int  Int
If Add
is derived to Add + Int
, there’s still the same nonterminal Add
on the left and this situation certainly will lead to an infinite recursion. Eliminating left recursion means[7]:
1) For every rule that exist a left recursion (Add
), move the nonrecursive part of this rule (+ Int
) to a new rule called Add'
.
Add' > + Int
2) Put Add'
at the end of this new rule.
Add' > + Int Add'
3) Place a special symbol called epsilon, which means nothing, as an alternative substitution.
Add' > + Int Add'  epsilon
4) Then fix first rule Exp > Add  Int
removing Add
and placing Add'
in front of Int
.
Exp > Int Add'
Here’s our grammar for addition operations converted into a LL(1) grammar:
Exp > Int Add' Add' > + Int Add'  epsilon Int > 0  1  ...  9
Observe that there’s no need to change the Int
production rule because it isn’t left recursive. Get rid of left recursion in this grammar is sufficient since it does not contains left common factors.
3) Implementing the Parser
Before starting coding the LL(1) parser, we have to define its table.
What’s this Table that Drives the Parser? The LL(1) Parser Table^{[7]}
A parser table is a lookup table that parsers use to know the next action to take, i.e. how to derive the next token of the input string. Every lookup table has a key associated with a value. In a LL(1) parser table, the key is formed by the terminals on the top and nonterminals on the left.
+ 
0 
1 
2 
... 
9 
$ 

Exp 

Add' 

Int 
The terminals of our grammar are +
, the digits from 0
to nine and $
(endofinput). There’s also the special symbol epsilon
(empty). The nonterminals are Exp
, Add'
and Int
.
We’ll fill this table using two functions called First and Follow on every nonterminal of the grammar.
The First Function
The First function evaluates which terminals can appear first when deriving a nonterminal.The result of First(Exp)
is First(Int)
because the right side (the production rule) for the start symbol Exp
is Int Add'
. The first terminal that appears when deriving Add
can only be the same as First(Int)
.
First(Exp) => First(Int)
The derivation rule for Add'
is + Int Add'  epsilon
, so the first function value in this case is ‘+’ or ‘epsilon’.
First(Add') = { + , epsilon }
The digits from zero to nine are the terminals that appear first when deriving the nonterminal Int
:
First(Int) = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }
Therefore we have:
First(Exp) = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }
First(Add') = { + , epsilon }
First(Int) = { 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 }
The Follow Function
The Follow function for a given nonterminal reveals which terminals can appear after the derivation of this same nonterminal. It is difficult to see the value for the Follow(Exp)
function, thus we create an imaginary production rule to help us see which terminals can appear immediately after the derivation of Add
:
S > Exp $
This imaginary newly created start rule S
derive the original start production rule Exp
followed by the endofinput symbol $
, so:
Follow(Exp) = { $ }
Follow(Add')
is a little trickier. It may also seem difficult to see what follows Add'
once there isn’t a nonterminal after the Add'
in the rules of the grammar. However, consider that Add'
can be derived from Exp > Int Add'
, so whatever comes after Add'
it comes after Exp
too. Then we have:
Follow(Add') => Follow(Exp)
Finally, the value of Follow(Int)
is First(Add')
whereas Int
is always followed for Add'
:
Follow(Int) = { + , epsilon }
Ops! A value epsilon
(empty) for a Follow function doesn’t exist; after all at least an endofinput ($
) must follow in the sentence. This epsilon
came from First(Add')
. If Add' > epsilon
, then the terminal that follows Int
can only be what follows Add'
, that is the $
symbol. Then:
Follow(Int) = { + , $ }
The Follow function application in all nonterminals results in:
Follow(Exp) = { $ }
Follow(Add') = { $ }
Follow(Int) = { + , $ }
Filling the Parser Table
Let’s fill the table with the values of the First and Follow functions .
Take the results of the First function to fill the table and, when the value for a given nonterminal is an epsilon
, use the result of the Follow function for this same nonterminal.
+ 
0 
1 
2 
... 
9 
$ 

Exp 
–  Exp > Int Add' 
Exp > Int Add' 
Exp > Int Add' 
... 
Exp > Int Add' 
– 
Add' 
Add' > + Int Add' 
–  –  –  ... 
–  Add' > epsilon 
Int 
–  Int > 0 
Int > 1 
Int > 2 
... 
Int > 9 
– 
Kicking off with the first line of the table, First(Exp)
results in terminals from 0
to 9
, thus we take the intersections of the line Exp
with the columns from 0
to 9
(the values of First(Exp)
) and fill them with the production rule of Exp
: Exp > Int Add'
.
First(Add')
results in an epsilon
or +
. Since there’s no column for epsilon
in the parser table (epsilon
isn’t a terminal), the proper value to use is Follow(Add')
: $
.
Proceeding that way will lead to the table above. There’s a software called ParsingEmu that you can use to check your grammar definition, the correctness of a parser table, the values of First and Follow functions and more.
Derivation Revisited
The table below shows how the derivation of the sentence 9 + 2 + 3 takes place inside a LL(1) tabledriven parser.
Input stream 
Stack 
Derivation 
9 + 2 + 3 $  Exp 
Exp > Int Add' 
9 + 2 + 3 $  Add' Int 
Int > 9 
9 + 2 + 3 $  Add' 9 

+ 2 + 3 $  Add' 
Add' > + Int Add' 
+ 2 + 3 $  Add' Int + 

2 + 3 $  Add' Int 
Int > 2 
2 + 3 $  Add' 2 

+ 3 $  Add' 
Add' > + Int Add' 
+ 3 $  Add' Int + 

3 $  Add' Int 
Int > 3 
3 $  Add' 3 

$  Add' 
Add’ > epsilon 
$  Yesss!! Bananas! 
Implementation
You can check an example of a parser implementation and its spec. Ptolomeu web page shows the current grammar supported by this parser.
Notes
* A more strict definition of LL(1) grammars found here is: “A grammar is said to be an LL(1) grammar if for every pair (A,a), of a nonterminal symbol A and a lookahead symbol a, there is at most one choice of a production rule A → α to be made.”
References
 Dick Grune, Ceriel Jacobs. Parsing Techniques: A Practical Guide. Ellis Horwood Ltd., 1990. ISBN: 0136514316.
 Formal grammar. (2011, February 3). In Wikipedia, The Free Encyclopedia. Retrieved 01:45, February 8, 2011, from http://en.wikipedia.org/w/index.php?title=Formal_grammar&oldid=411737777
 Terminal and nonterminal symbols. (2011, January 20). In Wikipedia, The Free Encyclopedia. Retrieved 03:59, February 8, 2011, from http://en.wikipedia.org/w/index.php?title=Terminal_and_nonterminal_symbols&oldid=408996680
 LR parser. (2010, November 28). In Wikipedia, The Free Encyclopedia. Retrieved 20:25, February 11, 2011, from http://en.wikipedia.org/w/index.php?title=LR_parser&oldid=399248916
 LL parser. (2010, November 27). In Wikipedia, The Free Encyclopedia. Retrieved 21:01, February 11, 2011, from http://en.wikipedia.org/w/index.php?title=LL_parser&oldid=399215667
 Predictive Parser. (2005, September 9). In Worcester Polytechnic Institute. Retrieved 21:27, February 11, 2011, from http://web.cs.wpi.edu/~gpollice/cs544f05/CourseNotes/maps/Class5/PredictiveParsers.html
 LL (1) Parsing. In Microsoft Research. Retrieved o4:44, February 12, 2011, from http://research.microsoft.com/enus/um/people/abegel/cs164/ll1.html