Is there a way to resolve this conflict with a hint?

Sep 10, 2011 at 3:37 AM

Hi!

I'm having a problem with a ambiguous grammar but I think that it can be solved with ReduceIf/ShiftIf, however I don't know how to use it. I hope you guys can help me out.

Here is my sample grammar

 

var expr = new NonTerminal("expr");
var amount = new NonTerminal("amount");
var itemsLeft = new NonTerminal("itemsLeft");

var money = new RegexBasedTerminal("money", @"(\d*((\,*)\d*)*\.?\d{1,2})");
var moneyIdentifier = new RegexBasedTerminal("identifier", @"USD|dollar|dollars");

var numericValue = new RegexBasedTerminal("numericValue", @"\d{1,3}");
var separator = new RegexBasedTerminal("separator", @"out of|left from");

expr.Rule =
	amount
	| itemsLeft;

amount.Rule =
	money
	| money + moneyIdentifier;

itemsLeft.Rule =
	numericValue + separator + numericValue;

this.Root = expr;

 

When you try the following two inputs you get the same parsing strategy even though they should yield a different expression as a result. 

Input 1: 1 USD

Input 2: 1 out of 3

In input 1 it correctly parses the string as a money (1) + identifier (USD), but in input 2 it tries to do the same finding 1 as money and failing by not finding and identifier when the correct would be a numericValue (1) + separator (out of) + another numericValue (3 should also be numericValue).

By reading the code and some docs and discussions here I was led to think that if we could make it look for the next token before taking a decision or starting all over when it finds a mistake it would work. In this case it should look for the second token before take the first one as money. If it finds that the second token is a separator and not an identifier then it can select the correct path.

I tried using ReduceIf() and ShiftIf() in numerous combinations but my knowledge failed me and I couldn't make it work :(

Hope you guys can help.

Thanks a lot!

Sep 10, 2011 at 4:16 PM

There is no ambiguity in your grammar and It is normal that It is only able to parse the amount rule, because the parser is unable to make the difference between the money and numericvalue (they can both accept 1, and the parser doesn't know anything about what they can accept, as they are regex terminal), so basically, the parser will try to first match the money terminal and it will work every time there is a number.

But if you use the same terminal for both money and numericValue, then you will be able to parse it:

            var expr = new NonTerminal("expr");
            var amount = new NonTerminal("amount");
            var itemsLeft = new NonTerminal("itemsLeft");

            var money = TerminalFactory.CreateCSharpNumber("number");
            var moneyIdentifier = new RegexBasedTerminal("identifier", @"USD|dollar|dollars");

            var numericValue = money;
            var separator = new RegexBasedTerminal("separator", @"out of|left from");

            expr.Rule =
                amount
                | itemsLeft;

            amount.Rule =
                money
                | money + moneyIdentifier;

            itemsLeft.Rule =
                numericValue + separator + numericValue;

            this.Root = expr;
Then you could plug on the itemsLeft rule a validator to verify the numbers.

Sep 16, 2011 at 2:24 AM

Great Alexandre!

It worked like a charm. And I even didn't have to use validation to make it work in my case. Also, this sugestion helped me with some other cases that I was resolving with a more complex solution.

Even though I didn't have to use validation, I think I could use some info about it.

Should I plug it into the ValidateToken event of a Terminal? If yes, how to tell the parser that a token is not valid and should be ignored to make it try other terminals?

Thanks a lot!

Coordinator
Sep 16, 2011 at 4:34 PM

ValidateToken event - this is a bit obscure, I will refactor this event to expose these methods more explicitly in eventArgs object. For now, to report error (invalid token) you should replace Contex.CurrentToken with Error token (containing an error message) - find examples in samples, there are some I"m sure. 

I want to point out that use of RegexBasedTerminal should be really discouraged and it should be used ONLY when there's no other option. The moneyIdentifier is not this case - you could instead define non-terminal and set its rule to:

currToken.Rule = ToTerm("USD") | "dollar" | "dollars"; 

this def is much easier for scanner/parser to digest and optimize - unlike regex which is a blackbox.

Roman

 

Sep 16, 2011 at 6:17 PM

Hi Roman,

Thanks for your reply :)

I ended up needing validation and tried to use the 

e.Context.AddParserError

but with it the parser would stop and I would get an error which wasn't what I needed. After poking around a bit I ended up doing 

e.Context.CurrentToken = null

which worked out fine because if the current token is null than Scanner keeps trying. 

I will try what you suggested.

Also, to follow your advice I would like to ask u what's the best way to deal with this cases like the following. I'm using only RegexTerminals all over. My DSL needs to parse regular phrases (in portuguese) so I can have prepositions, substantives, etc. That's why I thought that RegexTerminal was the right fit. Here is the sample:

var preposicao = new RegexBasedTerminal("preposicao", @"\b(n[ao]|numa)\b"); //some prepositions in portuguese

var data1 = new RegexBasedTerminal("data", @"\b\d{1,2}\/\d{1,2}(\/(\d{4}|\d{2}))?\b"); //date format

var data3 = new RegexBasedTerminal("data", @"\b(hoje|amanh[a�]|ontem|s[a�]bado|domingo|(segunda|ter[c�]a|quarta|quinta|sexta)((\-feira)|(\s?feira))?)\b"); //also date, but based on day of week names, like monday, thursday, etc

var data4 = new RegexBasedTerminal("data", @"\b\d{1,2}\b de \b\w+\b( de \b(\d{2}|\d{4})\b)?"); //and finally a date described with full string like "September 1 of 2011"

//And some NonTerminals using them

datas.Rule =
                data1 | data2 | data3 | data4;

local.Rule =
                preposicao + localToken
                | prefixoGenerico + localToken
                | localToken;

Thanks a lot for your close support!

Coordinator
Sep 18, 2011 at 4:17 PM

about these regex terminals - hard to say, if there's a way to replace them with more "direct" terminals. Generally, I would say that Irony and similar DFA-based parsers are not quite fit for parsing natural language-lke texts. LALR method which Irony uses expects strict unambigous rules language constructs, with certain restrictions.

Other parsing techologies and tools (Grammatica?)  might be more appropriate in this case

Roman