Help disambiguating a grammar

Aug 20, 2014 at 9:39 AM
I have a grammar defined by a legacy scripting language for a CAD system, which largely parses correctly using the following Irony grammar:
simple_statement.Rule = keyword + argument_list + semicolon;
argument_list.Rule = MakeListRule(argument_list, comma, named_argument, TermListOptions.AllowEmpty);
named_argument.Rule = identifier + "=" + expression;
The problem I'm trying to resolve arises from the use of the comma both to separate these "named arguments" as shown above, and also to separate the triple values used for x,y,z coordinate values.

So for example, a valid statement in the language is like this:
plc_line end1=0,0,0,end2=1,1,1;
I've tried adding the following (including in various other combinations), but I'm not knowledgeable enough to figure out how to resolve the conflicts in the grammar produced.
named_argument.Rule = identifier + "=" + (expression | coordinate);
coordinate.Rule = expression + comma + expression + comma + expression;
I'd appreciate any ideas on how I can resolve this parsing problem?
Aug 21, 2014 at 5:27 AM
I've made a small amount of progress on this problem, by modifying the coordinate grammar and adding the following custom action handler (largely copied from the included CSharpGrammar example:
coordinate.Rule = expression + CustomActionHere(this.ResolveCommaConflict) + comma + expression + comma + expression;

private void ResolveCommaConflict(ParsingContext context, CustomParserAction customAction)
    var scanner = context.Parser.Scanner;
    ParserAction action;

    /* After a comma, we could have either a named argument, or the "y" component of a coordinate value. We
     * disambiguate the two by skipping ahead two tokens and looking for an equal sign. If we find one, we
     * assume that we have found a named argument, otherwise that it's a coordinate. */
    if (context.CurrentParserInput.Term.Name == "comma")
        var preview = scanner.GetToken();

        if (preview.Terminal.Name == "=")
            action = customAction.ReduceActions.First();
            action = customAction.ShiftActions.First(a => a.Term.Name == "comma");

Unfortunately, while my coordinate values are now parsing correctly, it's making a mess of the non-coordinate argument_list parsing when there's more than one parameter. I'm having trouble understanding what's going wrong, as it appears to me (via debugging at runtime) that the ReduceActions.First() item is precisely the reduction that I want made, but it doesn't seem to be achieving the result I'm expecting?

Any advice would be much appreciated.
Aug 21, 2014 at 6:35 AM
I think the simplest way to handle it is to 'ignore' coordinate lists during parsing, and then rearrange parse tree after parsing. Here's the grammar:
        Terminal number = new NumberLiteral("number");
        IdentifierTerminal identifier = new IdentifierTerminal("Identifier");
        var simple_statement = new NonTerminal("simple_statement");
        var argument_list = new NonTerminal("argument_list");
        var named_argument = new NonTerminal("named_argument");
        var expression = new NonTerminal("expression");
        var argument = new NonTerminal("argument");
        var keyword = ToTerm("plc_line");
        var semicolon = ToTerm(";");
        var comma = ToTerm(",");

        simple_statement.Rule = keyword + argument_list + semicolon;
        argument_list.Rule = MakeListRule(argument_list, comma, argument, TermListOptions.AllowEmpty);
        argument.Rule = named_argument | expression; 
        named_argument.Rule = identifier + "=" + expression;
        expression.Rule = identifier | number;

        this.Root = simple_statement;      
Aug 21, 2014 at 6:39 AM
If you parse a string:
plc_line end1=0,0,0,end2=1,1,1;
  • you'll get plain arg list, with named arguments and non-named args. Then you can use a simple visitor to run through the tree and rearrange it - find all 'simple_statement' nodes, iterate thru args, and for each found named_argument, move all simple args that follow it to 'child' list of named argument.
    This stuff is quite difficult to express in LALR(1) grammars - theoretically possible, but quite tricky
Aug 21, 2014 at 6:52 AM
Thanks for the feedback Roman, but it seems like I was on the right track; the solution above works correctly, and the problem I encountered was being caused by a problem with the scanner. Although it does require the custom hint, I'm pretty satisfied with the result, so unless I find a real issue with it, I'll go with my own solution.

Regarding the bug I think I've found, I have two kinds of identifiers allowed in my grammar: regular identifiers (which are purely alphabetic), and "built in" variables, which must begin with a %% prefix. I've defined these as follows:
var identifier = new IdentifierTerminal("identifier");
var builtin_variable = new IdentifierTerminal("builtin_variable");
builtin_variable.AddPrefix("%%", IdOptions.NameIncludesPrefix);
This works fine normally (i.e. expressions are correctly parsed with references to identifier or builtin_variable as appropriate), but it seems that Scanner.GetToken() doesn't work correctly with the prefixed tokens when running in preview mode?

If I debug my custom action handler code above, the first call to scanner.GetToken() returns different results depending on whether the call to scanner.BeginPreview() is executed or not. If preview mode is inactive, it correctly returns an identifier token; when active, it returns a builtin_variable token.

This behaviour is also apparent if I remove my custom action handler from the grammar altogether, in which case the identifiers are parsed correctly - as soon as I add my handler in, it incorrectly matches identifiers using the builtin_variable rule, which causes reduction problems.

This seems to be a bug, but I can't see any obvious reason from a quick inspection of the scanner code why this might be happening?

PS: I also thought that if I used scanner.EndPreview(false), it might discard the incorrectly scanned tokens, and then correctly re-scan them when it continued after the handler, but instead it appears to completely discard them from the input queue? Is there any way to force the scanner to re-scan them?
Sep 5, 2014 at 6:01 AM
Any suggestions regarding the bug I've identified here?
Sep 5, 2014 at 6:00 PM
I think the problem is your misunderstanding of prefixes for identifiers. Prefix is an optional (!) thing, which if found is interpreted as additional option for identifier. Scanner assumes that if you defined prefix, there might be identifiers of this kind without prefixes. Then your grammar probably has ambiguity - any identifier without '%%' can be either of two id terminals you defined. Note that this kind of ambiguity does not result in grammar conflict reported in grammar explorer - Irony does not know that in fact two identifiers are internally ambiguous. However, it might parse correctly because during parsing the scanner gets assistance from LALR parser - when scanner sees a string of letters, he has two identifier terminals that are candidates (based on FirstChars() list from both). So scanner asks LALR parser - based on current parser state, which one is allowed? If your grammar/language is structured in such a way that each id type can appear in distinct, its own places, the response from parser allows to choose one correct terminal.
However, in scanner preview mode, the parser is frozen in a current state while scanner runs ahead trying to scan and find certain tokens. In this mode it cannot ask parser for assistance, so the scanner left on its own makes wrong decisions.
Here's how to fix it. Allow builtin_var to be scanned as identifier initially, but intercept the scanner and fix it if necessary. So add '%' as allowed first (and internal to account for second %) char in identifier. Hook to identifier.ValidateToken event. In the event handler check if the name starts with '%%' - if yes, replace the Terminal field in resulting token with buitin_var terminal. And remove adding prefix '%'. That should fix it, even in preview mode
Sep 6, 2014 at 6:46 AM
Thanks Roman; I follow your logic and have resolved the problem.

When I tried to apply the approach you outlined, I found an issue where the scanner was sometimes incorrectly matching my builtin_variable terminal when it should have been an identifier. I could fix this by adding another ValidateToken event handler to the builtin_variable, but I thought it was better to alter the approach slightly:
  1. Defined the builtin_variable terminal to allow % characters in the first and following set (the identifier terminal does not allow either).
  2. Added a ValidateTerminal() event handler for the builtin_variable terminal, which rejects any tokens which do not begin with %% (and validates that no additional % characters appear in the token).
This means the scanner automatically matches _identifier_s in many cases, and only sometimes tries to match a builtin_variable and has to try again if the event handler forces it to reject that match.

I've noticed that there seem to be two equally valid methods of rejecting a token, as follows; which is more correct?
e.Context.CurrentToken = null;
Also, is there any particular advantage or disadvantage of explicitly telling the scanner which token to match versus just rejecting it and allowing it to try again?
Sep 7, 2014 at 5:31 AM
the trouble with builtin_variable - you can fix it by simply making it lower priority (assign Termina.Priority value), lower than identifier. This will force scanner to always first try identifier. Currently, because both have equal priorities, it is a random choice in different parser states.
As for RejectToken and direct null assignment. Directly assigning CurrentToken is one possibility to replace the token with whatever you need in some tricky logic if you ever need to. If you assign null, you are rejecting it. As far as I recall, RejectToken is doing the same - to give a clear way to express to programmer how to do it. (and avoid extra questions on this forum)
I don't quite understand your last question about 'explicitly telling which token to match'. You mean 'terminal to use to match'? I don't think there's any difference in general in most situations, in many cases just like with RejectToken(), the extra 'different' ways of doing the same thing is an attempt to explicitly express certain common actions.