Nov 22, 2010 at 5:40 PM
Edited Nov 24, 2010 at 2:10 AM
I am writing a language service for Java. Since I want to do syntax highlighting, I need a parser for the Java language. After much searching and researching, I absolutely love Irony. My problem is writing a grammar that can handle Java
Here's what I've done so far:
I originally started modifying the sample C# grammar since, I thought, C# and Java are so similar. I quickly found out that they aren't that similar at all. I then thought that I can't be the first person to need to parse Java so I went in search
of what others have done. I have found a few different grammar specifications for Java in BNF form. The first ones I found were for ANTLR. At first I tried transcribing the rules by hand but that was WAY too monotonous and error
prone so I wrote an Irony grammar for ANTLR grammar files. Easy enough. Then I used Irony to parse the Java grammar file and use the tree to create Irony grammar code for Java. I played with that for a long time (about 2 weeks) with little
success when I noticed that ANTLR makes LL(k) parsers whereas Irony is an LALR(1) parser.
Luckily I found that SableCC had a Java grammar and it generates LALR(1) parsers so I was hoping to have more luck with this grammar. It also seemed more accurate with regards to the rules of Java. So I did the same process as above (i.e. Make
SableCC grammar, generate Irony code).
Also, as a sanity check, I created a parser (in C#) using SableCC and the Java grammar. It worked fine and created a nice parse tree that I verified was accurate according the test file I'm using. So I believe the grammar is correct.
My problem is that the Irony parser doesn't work. Specifically it often goes down the wrong path and ends up with a syntax error. I'm not overly surprised because I have numerous reduce-reduce conflicts and shift-reduce conflicts.
Lets start with the first reduce-reduce conflict in S0 (there are 1580 states). Here is the state:
State S0 (Inadequate)
compilation_unit' -> ·compilation_unit EOF
compilation_unit -> ·package_declaration_opt import_declarations_star type_declarations_star
package_declaration_opt -> ·package_declaration
package_declaration -> ·modifiers_star package identifier additional_identifiers_star semi
modifiers_star -> ·modifiers_star modifier
package_declaration_opt -> · [import semi class_token abstract at final native private protected public static strictfp synchronized transient volatile enum interface EOF]
modifiers_star -> · [package abstract at final native private protected public static strictfp synchronized transient volatile]
Transitions: compilation_unit->S1, package_declaration_opt->S2, package_declaration->S3, modifiers_star->S4
The conflict is:
Reduce-reduce conflict. State S0, lookaheads: abstract at final native private protected public static strictfp synchronized transient volatile. Selected reduce on first production in conflict set.
As I understand it, this means that if the next token is any one of the lookaheads then the parser doesn't know if it should reduce to a package_declaration_opt of a modifiers_star. In this case the lookaheads are all possible tokens in the production
"modifier". Why does it not reduce to a modifier, which could then reduce to a modifiers_star, which would then shift in package_declaration? I know how to specify whether to shift or reduce, but how do I push the parser towards one reduction
or another? Of course, I'm not sure that would help me because I'm not sure which one it should go to.
I'm a complete noob with regard to parsing theory, but I have done my research so I'm not clueless. However in this case I am stumped. Below are links to any files that may be helpful in understanding my dilema.
Any help is greatly appreciated. I can also share my ANTLR and SableCC grammars for Irony and my generation code if that is helpful.