What is a Compiler?
Compiler is a software which can make a executable byte code with a given high level program. Every program has keywords, variables, statements and expressions. What a compiler does is evaluating the meaning of such constructs and making an executable out of it.
Compiling process can be broken into several steps in a very clear way. A typical compiler has two modules
1. Front end - which does the lexical scanning (linear scanning) , syntax analysis (hierarchical scanning) and semantic analysis. It also make intermediate code (Eg: 3AC code), which is a systematic representation of the intended program. Generally this is optimized using heuristics collected so far.
2. Back end - which take the intermediate code and make the target code for the specific processor. Here is another optimization process done. The output is the byte code which is executable.
What is ANTLR (ANother Tool for Language Recognition)?
You do not need to write a lexer and a parser on your own. There are well known parser and lexer generators you can use!
ANTLR (http://www.antlr.org/) is such a recursive descent parser and lexer generator. The specialty with Antlr is that it generates both lexer and parser which are integrated together automatically.
Different plugins for ANTRL is available at
I would like to recommend a video lecture series on this here.
In this show I show how to generate a parser for C MINUS language which is a much reduced form of well-known language C. There are various plugins for ANTLR for different IDEs. One of them is the plugin for Eclipse.
Actually CMinus is a hypothetical programming language for academic purposes. Here is the set of Grammar rules which is used there in BNF form.
To make it a LL (K) grammar which can be parsed by a typical top down parser, it is required to do modifications.
- Remove left recursion
- Remove left factoring
· Remove left recursion:
A grammar rule like A Ã Ac|b is said to have left recursion. The general way to remove it is modifying rules as,
A Ã bA’
A’ Ã cA’
But ANTLR provide a special notation to overcome this problem. It is the cleen's *. Using that notation it is possible to eliminate left recursion.
The above grammar rule means ‘b’ followed by one or more ‘c’ s. It Can be written as
A Ã bC*
Here is a place where that concept has been used.
param_list
:param (COMMAR param)*
;
Removing dangling else problem:
Removing dangling else problem:
To get rid of the famous dangling else problem ‘Syntactic Predicates’ can be used in ANTLR. What Syntactic Predicates does is that it goes to the token level and check if there is a possibility to match the specified keyword.
selection_stmt
: IF OPENSIMPLEBRAKET expression CLOSESIMPLEBRACKET (statement)
( (ELSE)=> ELSE (statement)
| () //nothing
)
;
A Ã Bb | Bcd | b
Here compiler cannot choose between the production rules AÃ Ab and AÃ Acd. When the rule is left factored it looks like,
AÃ BA’ |b
A’Ã b|cd
In ANTLR it is possible to do 3 kinds of grammars.
1. Lexer grammar: input program is tokenized according to these rules.
2. Parser grammar: The output of lexer is fed into the parser by connecting parser grammar with lexer via headers.
3. Combined grammar: The lexer grammar and the parser grammar are written together on the same file.
Following is the lexer grammar I have developed to parse tokens in above proposed C minus language.
---------------------------------------------------------------
lexer grammar C_Minus_Lexer_Grammar;
options {
language = Java;
}
@header {
package CMinus.Core;
}
//special keywords used in C Minus
ELSE : 'else';
IF : 'if';
INT : 'int';
RETURN : 'return';
VOID : 'void';
WHILE : 'while';
//mathematical operations used in C minus
ADDITION : '+';
SUBTRACTION : '-';
MUTIP : '*';
DIVISION : '/';
//Comparators used in C Minus
LESSTHAN : '<';
LESSTHANOEQUAL : '<=';
GREATERTHAN : '>';
GREATERTHANOEQUAL : '>=';
EQUALSEQUALS : '==';
NOTEQUALS : '!=';
//special characters defined
EQUAL : '=';
SEMICOLON : ';';
COMMAR : ',';
//various types of brackets supported by C Minus
OPENSIMPLEBRAKET : '(';
CLOSESIMPLEBRACKET : ')';
OPENSQUAREBRAKET : '[';
CLOSESQUAREBRAKET : ']';
OPENBRACKETS : '{';
CLOSEBRACKETS : '}';
OPENCOMMENT : '/*';
CLOSECOMMENT : '*/';
// Letters and degits used in C Minus
fragment LETTER : 'a'..'z'|'A'..'Z';
fragment DIGIT : '0'..'9';
NUM : (DIGIT)(DIGIT)*;
IDENT : (LETTER)(LETTER)*;
// Windows uses \r\n as new line. UNIX and Mac OS X use \n.
fragment NEWLINE: ('\r'| '\n')+;
// Send runs of space and tab characters to the hidden channel. Parser will neglect them
WHITESPACE: (' ' | '\t' | NEWLINE )+ { $channel = HIDDEN; };
// Multi-line comments are delimited by /* and */
// and are optionally followed by newline characters.
MULTI_COMMENT options { greedy = false; }
: OPENCOMMENT .* CLOSECOMMENT NEWLINE? { $channel = HIDDEN;};
-----------------------------------------------------------------------
The parser grammar is depicted below. It uses syntactic predicates described above. Thus if one is checking the generated parse tree with Antlr he or she should use debug mode in Antlr works. That is because eclipse plugin interpreter does not support parsing syntactic predicates. In Antlr works red branches shows where the syntactic predicate was not matched and green branches shows where they have matched.
-----------------------------------------------------------------------
parser grammar C_Minus_Parser_Grammar;
options {
language = Java;
tokenVocab = C_Minus_Lexer_Grammar;
output=AST;
ASTLabelType=CommonTree;
}
@header {
package CMinus.Core;
}
program
:declaration_list
;
declaration_list
: (declaration)+
;
declaration
:var_declaration | fun_declaration
;
var_declaration
:type_specifier IDENT SEMICOLON | type_specifier IDENT OPENSQUAREBRAKET NUM CLOSESQUAREBRAKET SEMICOLON
;
type_specifier
: INT | VOID
;
fun_declaration
: type_specifier IDENT OPENSIMPLEBRAKET params CLOSESIMPLEBRACKET compound_stmt
;
params
:param_list | VOID
;
param_list
:param (COMMAR param)*
;
param
: type_specifier IDENT | type_specifier IDENT OPENSQUAREBRAKET CLOSESQUAREBRAKET
;
compound_stmt
: OPENBRACKETS local_declarations statement_list CLOSEBRACKETS
;
local_declarations
:(var_declaration)*
;
statement_list
: (statement)*
;
statement
: expression_stmt | compound_stmt | selection_stmt | iteration_stmt | return_stmt
;
expression_stmt
: expression SEMICOLON | SEMICOLON
;
// dangling else problem can be solved using left factoring, applying syntatic predicates
selection_stmt
: IF OPENSIMPLEBRAKET expression CLOSESIMPLEBRACKET (statement)
( (ELSE)=> ELSE (statement)
| () //nothing
)
;
iteration_stmt
: WHILE OPENSIMPLEBRAKET expression CLOSESIMPLEBRACKET statement
;
return_stmt
: RETURN SEMICOLON | RETURN expression SEMICOLON
;
//expression
// : var EQUAL expression | simple_expression
// ;
expression
: (var EQUAL ) => var EQUAL expression | simple_expression
;
// expression
// : (simple_expression) (EQUAL simple_expression)*
// ;
var
: IDENT | IDENT OPENSQUAREBRAKET expression CLOSESQUAREBRAKET
;
simple_expression
: additive_expression (relop additive_expression)?
;
relop
:LESSTHANOEQUAL | LESSTHAN | GREATERTHAN | GREATERTHANOEQUAL | EQUALSEQUALS |NOTEQUALS
;
additive_expression
:term (addop term)*
;
addop
: ADDITION | SUBTRACTION
;
term
:factor(mulop factor)*
;
mulop
: MUTIP | DIVISION
;
factor
: OPENSIMPLEBRAKET expression CLOSESIMPLEBRACKET | var | call | NUM
;
call
: IDENT OPENSIMPLEBRAKET args CLOSESIMPLEBRACKET
;
args
: arg_list | () //nothing
;
arg_list
:expression(COMMAR expression)*
;
----------------------------------------------------------------------
Using following code one can optionally check the output of the lexer
CharStream charStream = new ANTLRStringStream(input);
C_Minus_Lexer_Grammar lexer = new C_Minus_Lexer_Grammar(charStream);
TokenStream tokenInputStream = new CommonTokenStream(lexer);