My Irony grammar class for parsing Java is not working... yet.

Nov 22, 2010 at 6:40 PM
Edited Nov 24, 2010 at 3:10 AM

I am writing a language service for Java.  Since I want to do syntax highlighting, I need a parser for the Java language.  After much searching and researching, I absolutely love Irony.  My problem is writing a grammar that can handle Java source code.

Here's what I've done so far:

I originally started modifying the sample C# grammar since, I thought, C# and Java are so similar.  I quickly found out that they aren't that similar at all.  I then thought that I can't be the first person to need to parse Java so I went in search of what others have done.  I have found a few different grammar specifications for Java in BNF form.  The first ones I found were for ANTLR.  At first I tried transcribing the rules by hand but that was WAY too monotonous and error prone so I wrote an Irony grammar for ANTLR grammar files.  Easy enough.  Then I used Irony to parse the Java grammar file and use the tree to create Irony grammar code for Java.  I played with that for a long time (about 2 weeks) with little success when I noticed that ANTLR makes LL(k) parsers whereas Irony is an LALR(1) parser.

Luckily I found that SableCC had a Java grammar and it generates LALR(1) parsers so I was hoping to have more luck with this grammar.  It also seemed more accurate with regards to the rules of Java.  So I did the same process as above (i.e. Make SableCC grammar, generate Irony code).

Also, as a sanity check, I created a parser (in C#) using SableCC and the Java grammar.  It worked fine and created a nice parse tree that I verified was accurate according the test file I'm using.  So I believe the grammar is correct.

My problem is that the Irony parser doesn't work.  Specifically it often goes down the wrong path and ends up with a syntax error.  I'm not overly surprised because I have numerous reduce-reduce conflicts and shift-reduce conflicts.

Lets start with the first reduce-reduce conflict in S0 (there are 1580 states).  Here is the state:

State S0 (Inadequate)
  Shift items:
    compilation_unit' -> ·compilation_unit EOF
     compilation_unit -> ·package_declaration_opt import_declarations_star type_declarations_star
     package_declaration_opt -> ·package_declaration
     package_declaration -> ·modifiers_star package identifier additional_identifiers_star semi
     modifiers_star -> ·modifiers_star modifier
   Reduce items:
    package_declaration_opt -> · [import semi class_token abstract at final native private protected public static strictfp synchronized transient volatile enum interface EOF]
    modifiers_star -> · [package abstract at final native private protected public static strictfp synchronized transient volatile]
  Transitions: compilation_unit->S1, package_declaration_opt->S2, package_declaration->S3, modifiers_star->S4

The conflict is:

Reduce-reduce conflict. State S0, lookaheads: abstract at final native private protected public static strictfp synchronized transient volatile. Selected reduce on first production in conflict set.

As I understand it, this means that if the next token is any one of the lookaheads then the parser doesn't know if it should reduce to a package_declaration_opt of a modifiers_star.  In this case the lookaheads are all possible tokens in the production "modifier".  Why does it not reduce to a modifier, which could then reduce to a modifiers_star, which would then shift in package_declaration? I know how to specify whether to shift or reduce, but how do I push the parser towards one reduction or another?  Of course, I'm not sure that would help me because I'm not sure which one it should go to.

I'm a complete noob with regard to parsing theory, but I have done my research so I'm not clueless.  However in this case I am stumped.  Below are links to any files that may be helpful in understanding my dilema.

Any help is greatly appreciated.  I can also share my ANTLR and SableCC grammars for Irony and my generation code if that is helpful.

Jere

Coordinator
Nov 24, 2010 at 12:46 AM
Edited Nov 24, 2010 at 12:49 AM

Well, the main advise I can give - don't reuse grammars for other tools, that wouldn't work, unfortunately. Tried it myself, that does not work.

Find java spec, with BNF forms, and reconstruct Irony grammar following it section by section. Keep in mind, the official specs are not perfect, c# has several mistakes there. SableCC - hard to guess why the grammar works for Sable and not with Irony. Conflicts Irony displays are LALR(1), algorithm-based conflicts, not faults of an implementation. It might be SableCC implements GLR (parallel GLR, following all conflicting alternatives until one path wins). 

Next - don't try to run/parse anything until you resolve the conflicts - either by restructuring the grammar, or by using hints with explicit resolution instruction. You must have clear grammar, free of reported conflicts before you start parsing.

Looking at the grammar - it certainly might be improved and simplified. Keep in mind, SableCC does not support operator precedence, and some other advanced tricks (like list construction); so SableCC grammar is usually much more complex than equivalent in Irony. Look here for some comparison:

http://intellect.dk/post/Writing-a-calculator-in-C-using-Irony.aspx

One thing I see is the way you construct lists with delimiters: Irony's MakePlus/StarRule methods support delimiters in lists, and even lists with optional ending delimiters.

So go ahead and start rewriting it using native Irony facilities. My advice - don't "overdeclare" things, Irony supports using string literals directly in expressions. Simplify, simplify, simplify - you have enough complexity ahead. 

About some code in your grammar, setting up terminals - all these huge sequences of char intervals. Something wrong is here. My guess is that situation here is similar to c# identifier, where (as spec says) - valid characters must belong to certain UNICODE categories. Look at java spec, compare with c# identifier setup in TerminalFactory - I guess you can do something similar here.

Good luck

Roman

 

Nov 24, 2010 at 2:02 AM

Thanks for the advice Roman.  I was going the other way.  Trying to declare things that might be similar as separate rules.

I have a couple comments:

1) Replacing one of the more common constructs (that of identifier additional_identifiers*) with the following solved almost all of my reduce-reduce conflicts:

qualified_name.Rule = MakePlusRule(qualified_name, DOT, identifier)
2) Regarding the list of characters, those are accurate.  I even made a blog post about it: http://vsadt.squarespace.com/blog/2010/11/16/java-real-numbers-do-the-funky-chicken-in-hex.html

And a couple specific questions:

1) Is it possible to have a MakeStarRule in the middle of another rule or must it be its own rule?  Something like this: 

class_body.Rule = L_BRC + MakeStarRule(**What would go here?**, class_body_declaration) + R_BRC;

2) How would I solve the the conflict in my first post?  The problem, as I see it, is that both a "package_declaration" and a "type_declaration" can start with a "modifier".  Since "package_declaration" and "import declarations" are optional, the parser doesn't know whether it should go down the "package_declaration" rabbit hole or the "type_declaration" rabbit hole.  But if the parser would just parser the modifiers then the next symbol would tell it whether it was looking at a "package_declaration" or a "type declaration".  What am I missing?

3) Should I even care about operator precedence right now?  My first goal is to get a tree, any tree.  Then I can worry about whether the tree splits in the right spots.

4) If I'm going to write it from scratch, I'll have to have some partial successes.  Toward that end, how do I specify a rule that says, essentially, suck up everything until you find a semicolon?  I tried using a FreeTextLiteral, but it caused Irony to grab memory at an unbelievable rate.

Thanks!

Jere

Coordinator
Nov 24, 2010 at 2:50 AM

Hi 

answering your questions:

1) "Replacing one of the more common construc...."    That's what I would expect: Irony's implementation for list construction is conflict-free and well tested.

2) List of characters - I understand that they are accurate, I just think there is some underlying principle - like certain Unicode categories are allowed. In this case these restrictions can be expressed more easily. Irony's identifier supports this concept for c#; given many similarities between c# and java I would expect very similar thing here. I looked at Jave spec (from your blog), and found quite surprising definition: java identifier character is a character for which the function Character.isJavaIdentifierStart(int) returns true... hmm... strange way to define language basic artifacts - through reference to its runtime... and she calls me stupid.. well, I hope there's some better structured idea behind this

3) Make star rule - no, make star rule must be standalone. In fact, inside MakeStarRule, it already assigns list.Rule value; so writing it as " list.Rule = MakeStarRule(list, elem); " is just for clarity, to show that list.Rule is also assigned along with other non-terminals

4) Initial conflict in S0 - will look at it, please post updated grammar 

5) Operator precedence is for automatic conflict resolution. Having it in Irony allows you to greatly simplify grammars for expressions, bundle them all into one binary expression; if you don't have prec support, you have to do very painful construction of various operators in their prec order - again, see the link about writing calculator, it all explained there

6) Partial successes - certainly good thing; but then start with simplified grammar, top-level elements, make grammar conflict-free, parse samples; then add more detailed elements to grammar, again fix the conflicts, parse samples etc

It is just grammar with conflicts has no practical use - you see, it fails even to parse your samples correctly

About Irony sucking memory - that's alarming sign.. can you please give more details?

 

Nov 24, 2010 at 4:56 AM

Here are the updated files.  The rest of the files are the same:

Thanks for your expert eye,

Jere

Nov 24, 2010 at 6:34 AM
Edited Nov 24, 2010 at 6:35 AM

As per your suggestion, I started with a clean slate and am starting with just the top level elements.  I just can't get the reduce-reduce conflict to go away on S0.

This is the my Java grammar project as I have it now (downloadable here):

My only rules (as a reminder, ALL CAPS are Terminals):

 

compilation_unit.Rule = package_declaration + import_declarations + type_declarations;
qualified_name.Rule = MakePlusRule(qualified_name, DOT, identifier);
package_declaration.Rule = (Empty | (modifiers + PACKAGE + qualified_name + SEMI));
import_declarations.Rule = MakeStarRule(import_declarations, import_declaration);
import_declaration.Rule = IMPORT + qualified_name + import_wildcard + SEMI;
import_wildcard.Rule = Empty | (DOT + STAR);
type_declarations.Rule = MakeStarRule(type_declarations, type_declaration);
type_declaration.Rule = modifiers + CLASS_TOKEN + identifier + L_BRC + R_BRC;
modifiers.Rule = MakeStarRule(modifiers, modifier);
modifier.Rule = ABSTRACT | FINAL | NATIVE | PRIVATE | PROTECTED | PUBLIC | STATIC | STRICTFP | SYNCHRONIZED | TRANSIENT | VOLATILE;

S0 looks almost identical to the one in my original post.

 

I know you said don't try to parse anything until I have a clean grammar, but I just can't help it. :-)  The following sample *should* parse cleanly with only the above rules:

 

public package com.test.jere;

import com.test.roman.Irony;
import com.test.cristina.*;

public class GrammarTest {

}

 

But there is, unsurprisingly, a syntax error at (0:7) because it chooses to reduce the entire package_declaration.  

I seem to be stuck on a very basic concept. What am I missing?

Thanks,

Jere

Coordinator
Nov 24, 2010 at 7:37 AM

well, why it happens - is clear; when parser sees a modifier like "public", he cannot know what is that - a first token of class declaration (in which case it should create empty package declaration and imports), or it is the first "public" of package declaration.

The trouble with star lists is that the list element is created as empty list, before the first element is consumed - but this is some technicalities.

How to fix it - let me think a bit

 

Nov 24, 2010 at 10:14 AM

I have found one way to eliminate the conflict although it "feels" wrong.  Here is the new grammar:

compilation_unit.Rule = Empty;
compilation_unit.Rule |= modifiers + package_declaration + import_declarations + modifiers + type_declaration + type_declarations;
compilation_unit.Rule |=             package_declaration + import_declarations + modifiers + type_declaration + type_declarations;
compilation_unit.Rule |=                                   import_declarations + modifiers + type_declaration + type_declarations;
compilation_unit.Rule |=                                   import_declarations +             type_declaration + type_declarations;
compilation_unit.Rule |= modifiers + package_declaration +                       modifiers + type_declaration + type_declarations;
compilation_unit.Rule |=             package_declaration +                                   type_declaration + type_declarations;
compilation_unit.Rule |=                                                         modifiers + type_declaration + type_declarations;
compilation_unit.Rule |=                                                                     type_declaration + type_declarations;
compilation_unit.Rule |= modifiers + package_declaration + import_declarations;
compilation_unit.Rule |=             package_declaration + import_declarations;
compilation_unit.Rule |=                                   import_declarations;
compilation_unit.Rule |= modifiers + package_declaration;
compilation_unit.Rule |=             package_declaration;
			
qualified_name.Rule = MakePlusRule(qualified_name, DOT, identifier);
package_declaration.Rule = PACKAGE + qualified_name + SEMI;
import_declarations.Rule = MakePlusRule(import_declarations, import_declaration);
import_declaration.Rule = normal_import_declaration | static_import_declaration;
normal_import_declaration.Rule = IMPORT + qualified_name + import_wildcard + SEMI;
static_import_declaration.Rule = IMPORT + STATIC + qualified_name + import_wildcard + SEMI;
import_wildcard.Rule = Empty | (DOT + STAR);
type_declarations.Rule = MakeStarRule(type_declarations, type_declaration_with_modifiers);
type_declaration_with_modifiers.Rule = modifiers.Q() + type_declaration;
type_declaration.Rule = CLASS_TOKEN + identifier + L_BRC + R_BRC;
modifiers.Rule = MakePlusRule(modifiers, modifier);
modifier.Rule = ABSTRACT | FINAL | NATIVE | PRIVATE | PROTECTED | PUBLIC | STATIC | STRICTFP | SYNCHRONIZED | TRANSIENT | VOLATILE;

Does this look reasonable to you?

Jere

Nov 24, 2010 at 2:25 PM
rivantsov wrote:

The trouble with star lists is that the list element is created as empty list, before the first element is consumed - but this is some technicalities.

This actually could be the source of all of my conflicts. I have noticed that all of my conflicts have dealt with star lists. Is this something that you can fix?

Coordinator
Nov 24, 2010 at 4:11 PM
Edited Nov 24, 2010 at 4:12 PM

I suggest an alternative:

 

            var declaration = new NonTerminal("declaration");
            var declarations = new NonTerminal("declarations"); 
            declarations.Rule = MakeStarRule(declarations, declaration); 
            declaration.Rule = package_declaration | import_declaration | type_declaration; 
	    compilation_unit.Rule = declarations;

 

this allows arbitrary order of package/import/type declarations in grammar and during parsing; you can simply verify the proper order in code after parsing, walking over the tree

Nov 25, 2010 at 2:43 PM

Thanks for the help.  I am making progress now.  Would you have any interest in including my finished Java grammar with Irony?

Jere

Coordinator
Nov 25, 2010 at 4:09 PM

Absolutely, i can include it in samples with proper credits

Good luck