Compiler is a software which can make a executable byte code with a given high level program. Every program has keywords, variables, statements and expressions. What a compiler does is evaluating the meaning of such constructs and making an executable out of it.
Compiling process can be broken into several steps in a very clear way. A typical compiler has two modules
· Remove left recursion:
A grammar rule like A Ã Ac|b is said to have left recursion. The general way to remove it is modifying rules as,
A Ã bA’
A’ Ã cA’
But ANTLR provide a special notation to overcome this problem. It is the cleen's *. Using that notation it is possible to eliminate left recursion.
The above grammar rule means ‘b’ followed by one or more ‘c’ s. It Can be written as
A Ã bC*
Here is a place where that concept has been used.
param_list
:param (COMMAR param)*
;
Removing dangling else problem:
To get rid of the famous dangling else problem ‘Syntactic Predicates’ can be used in ANTLR. What Syntactic Predicates does is that it goes to the token level and check if there is a possibility to match the specified keyword.
selection_stmt
: IF OPENSIMPLEBRAKET expression CLOSESIMPLEBRACKET (statement)
( (ELSE)=> ELSE (statement)
| () //nothing
)
;
Removing left factoring
A Ã Bb | Bcd | b
Here compiler cannot choose between the production rules AÃ Ab and AÃ Acd. When the rule is left factored it looks like,
AÃ BA’ |b
A’Ã b|cd
In ANTLR it is possible to do 3 kinds of grammars.
1. Lexer grammar: input program is tokenized according to these rules.
2. Parser grammar: The output of lexer is fed into the parser by connecting parser grammar with lexer via headers.
3. Combined grammar: The lexer grammar and the parser grammar are written together on the same file.
Following is the lexer grammar I have developed to parse tokens in above proposed C minus language.
---------------------------------------------------------------
lexer grammar C_Minus_Lexer_Grammar;
options {
language = Java;
}
@header {
package CMinus.Core;
}
//special keywords used in C Minus
ELSE : 'else';
IF : 'if';
INT : 'int';
RETURN : 'return';
VOID : 'void';
WHILE : 'while';
//mathematical operations used in C minus
ADDITION : '+';
SUBTRACTION : '-';
MUTIP : '*';
DIVISION : '/';
//Comparators used in C Minus
LESSTHAN : '<';
LESSTHANOEQUAL : '<=';
GREATERTHAN : '>';
GREATERTHANOEQUAL : '>=';
EQUALSEQUALS : '==';
NOTEQUALS : '!=';
//special characters defined
EQUAL : '=';
SEMICOLON : ';';
COMMAR : ',';
//various types of brackets supported by C Minus
OPENSIMPLEBRAKET : '(';
CLOSESIMPLEBRACKET : ')';
OPENSQUAREBRAKET : '[';
CLOSESQUAREBRAKET : ']';
OPENBRACKETS : '{';
CLOSEBRACKETS : '}';
OPENCOMMENT : '/*';
CLOSECOMMENT : '*/';
// Letters and degits used in C Minus
fragment LETTER : 'a'..'z'|'A'..'Z';
fragment DIGIT : '0'..'9';
NUM : (DIGIT)(DIGIT)*;
IDENT : (LETTER)(LETTER)*;
// Windows uses \r\n as new line. UNIX and Mac OS X use \n.
fragment NEWLINE: ('\r'| '\n')+;
// Send runs of space and tab characters to the hidden channel. Parser will neglect them
WHITESPACE: (' ' | '\t' | NEWLINE )+ { $channel = HIDDEN; };
// Multi-line comments are delimited by /* and */
// and are optionally followed by newline characters.
MULTI_COMMENT options { greedy = false; }
: OPENCOMMENT .* CLOSECOMMENT NEWLINE? { $channel = HIDDEN;};
-----------------------------------------------------------------------
The parser grammar is depicted below. It uses syntactic predicates described above. Thus if one is checking the generated parse tree with Antlr he or she should use debug mode in Antlr works. That is because eclipse plugin interpreter does not support parsing syntactic predicates. In Antlr works red branches shows where the syntactic predicate was not matched and green branches shows where they have matched.
-----------------------------------------------------------------------
parser grammar C_Minus_Parser_Grammar;
options {
language = Java;
tokenVocab = C_Minus_Lexer_Grammar;
output=AST;
ASTLabelType=CommonTree;
}
@header {
package CMinus.Core;
}
program
:declaration_list
;
declaration_list
: (declaration)+
;
declaration
:var_declaration | fun_declaration
;
var_declaration
:type_specifier IDENT SEMICOLON | type_specifier IDENT OPENSQUAREBRAKET NUM CLOSESQUAREBRAKET SEMICOLON
;
type_specifier
: INT | VOID
;
fun_declaration
: type_specifier IDENT OPENSIMPLEBRAKET params CLOSESIMPLEBRACKET compound_stmt
;
params
:param_list | VOID
;
param_list
:param (COMMAR param)*
;
param
: type_specifier IDENT | type_specifier IDENT OPENSQUAREBRAKET CLOSESQUAREBRAKET
;
compound_stmt
: OPENBRACKETS local_declarations statement_list CLOSEBRACKETS
;
local_declarations
:(var_declaration)*
;
statement_list
: (statement)*
;
statement
: expression_stmt | compound_stmt | selection_stmt | iteration_stmt | return_stmt
;
expression_stmt
: expression SEMICOLON | SEMICOLON
;
// dangling else problem can be solved using left factoring, applying syntatic predicates
selection_stmt
: IF OPENSIMPLEBRAKET expression CLOSESIMPLEBRACKET (statement)
( (ELSE)=> ELSE (statement)
| () //nothing
)
;
iteration_stmt
: WHILE OPENSIMPLEBRAKET expression CLOSESIMPLEBRACKET statement
;
return_stmt
: RETURN SEMICOLON | RETURN expression SEMICOLON
;
//expression
// : var EQUAL expression | simple_expression
// ;
expression
: (var EQUAL ) => var EQUAL expression | simple_expression
;
// expression
// : (simple_expression) (EQUAL simple_expression)*
// ;
var
: IDENT | IDENT OPENSQUAREBRAKET expression CLOSESQUAREBRAKET
;
simple_expression
: additive_expression (relop additive_expression)?
;
relop
:LESSTHANOEQUAL | LESSTHAN | GREATERTHAN | GREATERTHANOEQUAL | EQUALSEQUALS |NOTEQUALS
;
additive_expression
:term (addop term)*
;
addop
: ADDITION | SUBTRACTION
;
term
:factor(mulop factor)*
;
mulop
: MUTIP | DIVISION
;
factor
: OPENSIMPLEBRAKET expression CLOSESIMPLEBRACKET | var | call | NUM
;
call
: IDENT OPENSIMPLEBRAKET args CLOSESIMPLEBRACKET
;
args
: arg_list | () //nothing
;
arg_list
:expression(COMMAR expression)*
;
----------------------------------------------------------------------
Using following code one can optionally check the output of the lexer
CharStream charStream = new ANTLRStringStream(input);
C_Minus_Lexer_Grammar lexer = new C_Minus_Lexer_Grammar(charStream);
TokenStream tokenInputStream = new CommonTokenStream(lexer);