Need help fixing grammar conflicts for AngelScript

Jun 23, 2015 at 4:55 AM
Edited Jun 23, 2015 at 5:29 AM
I have been working with a friend on creating a grammar for parsing AngelScript using Irony. As usual there were various ambiguities in the grammar specification that cause conflicts. I have attempted to fix these however I have not been 100% successful. I think that I must be using the hints and ReduceIf calls incorrectly.

Below this are some specific questions I have, then below the questions is the full grammar spec for AngelScript. I have also uploaded my current sourcecode to https://gist.github.com/anonymous/f2186d671b8cbe71c91e.
Can you please point out if I am doing things incorrectly in my sourcecode?

1) Does ReduceIf() only operate on a single character rather than strings of characters?

2) Does TokenPreviewHint() only operate on a single character rather than strings of characters?

3) I tried to differentiate between "modifiers" (such as private, public, shared, etc) for various things like functions and classes based on example #4 from "Irony.Tests\TokenPreviewResolution\ConflictGrammars.cs". However when I look at the Parse Trace in "Irony Grammer Explorer" it appears that my TokenPreviewHint is not working properly.

I do a test parse of:
shared enum MyEnum { eValue0, eValue2 = 2, eValue3 = 3 }
Using the hint rule (from my sourcecode):
var enum_mod_hint = new TokenPreviewHint(PreferredActionType.Reduce, k_ENUM,  k_CLASS, op_L_PAR, k_INTERFACE);
And the Parse Trace reports:
Conditional Parser Action.
Checking condition: Reduce if enum comes before class l_par interface.
All conditions failed, executing default action. Reduce on 'enum_modifier->shared'
Popped state from stack, pushing enum_modifier.
I would expect the condition to be TRUE here based on my sourcecode and the text being parsed. What am I doing wrong?

4) I am still having grammar conflict errors in the rule for INTERFACE_IV, where it should choose INTFMTHD if it sees a "{" character before "(" or choose VIRTPROP if it sees a "(" before a "{". I have tried inserting hints in various ways, but every way I do it I still get the grammar conflicts. My sourcecode shows some of the various ways I have tried to do so in commented out code lines are the line:
INTERFACE_IV.Rule = ReduceIf("{", "(") + INTFMTHD | ReduceIf("(", "{") + VIRTPROP;
What am I doing wrong here? Why does this not work when it seems to me that it should work?



This is the full AngelScript grammar:
ARGLIST ::= '(' [IDENTIFIER ':'] ASSIGN {',' [IDENTIFIER ':'] ASSIGN} ')'
ASSIGN ::= CONDITION [ ASSIGNOP ASSIGN ]
ASSIGNOP ::= '=' | '+=' | '-=' | '*=' | '/=' | '|=' | '&=' | '^=' | '%=' | '**=' | '<<=' | '>>=' | '>>>='
BITOP ::= '&' | '|' | '^' | '<<' | '>>' | '>>>'
BITS ::= single token: binary 0b or 0B, octal 0o or 0O, decimal 0d or 0D, hexadecimal 0x or 0X
BREAK ::= 'break' ';'
CASE ::= (('case' EXPR) | 'default') ':' {STATEMENT}
CAST ::= 'cast' '<' TYPE '>' '(' ASSIGN ')'
CLASS ::= {'shared' | 'abstract' | 'final'} 'class' IDENTIFIER [':' IDENTIFIER {',' IDENTIFIER}] '{' {VIRTPROP | FUNC | VAR} '}'
COMPOP ::= '==' | '!=' | '<' | '<=' | '>' | '>=' | 'is' | '!is'
CONDITION ::= EXPR ['?' ASSIGN ':' ASSIGN]
CONSTRUCTCALL ::= TYPE ARGLIST
CONTINUE ::= 'continue' ';'
DATATYPE ::= IDENTIFIER | PRIMTYPE | '?' | 'auto'
DOWHILE ::= 'do' STATEMENT 'while' '(' ASSIGN ')' ';'
ENUM ::= ['shared'] 'enum' IDENTIFIER '{' IDENTIFIER ['=' EXPR] {',' IDENTIFIER ['=' EXPR]} '}'
EXPR ::= (TYPE '=' INILIST) | (EXPRTERM {EXPROP EXPRTERM})
EXPR_OP ::= MATHOP | COMPOP | LOGICOP | BITOP
EXPR_PREOP ::= '-' | '+' | '!' | '++' | '--' | '~' | '@'
EXPR_POSTOP ::= ('.' (FUNCCALL | IDENTIFIER)) | ('[' [IDENTIFIER ':'] ASSIGN {',' [IDENTIFIER ':' ASSIGN} ']') | ARGLIST | '++' | '--'
EXPR_STAT ::= [ASSIGN] ';'
EXPR_TERM ::= {EXPRPREOP} EXPRVALUE {EXPRPOSTOP}
EXPR_VALUE ::= 'void' | CONSTRUCTCALL | FUNCCALL | VARACCESS | CAST | LITERAL | '(' ASSIGN ')'
FOR ::= 'for' '(' (VAR | EXPR_STAT) EXPR_STAT [ASSIGN {',' ASSIGN}] ')' STATEMENT
FUNC ::= ['private' | 'protected' | 'shared'] [((TYPE ['&']) | '~')] IDENTIFIER PARAMLIST ['const'] {'override' | 'final'} STATBLOCK
FUNCCALL ::= SCOPE IDENTIFIER ARGLIST
FUNCDEF ::= 'funcdef' TYPE ['&'] IDENTIFIER PARAMLIST ';'
IDENTIFIER ::= single token: starts with letter or _, can include any letter and digit, same as in C++
IF ::= 'if' '(' ASSIGN ')' STATEMENT ['else' STATEMENT]
IMPORT ::= 'import' TYPE ['&'] IDENTIFIER PARAMLIST 'from' STRING ';'
INITLIST ::= '{' [ASSIGN | INITLIST] {',' [ASSIGN | INITLIST]} '}'
INTERFACE ::= ['shared'] 'interface' IDENTIFIER [':' IDENTIFIER {',' IDENTIFIER}] '{' {VIRTPROP | INTFMTHD} '}'
INTFMTHD ::= TYPE ['&'] IDENTIFIER PARAMLIST ['const'] ';'
LITERAL ::= NUMBER | STRING | BITS | 'true' | 'false' | 'null'
LOGICOP ::= '&&' | '||' | '^^' | 'and' | 'or' | 'xor'
MATHOP ::= '+' | '-' | '' | '/' | '%' | '*'
MIXIN ::= 'mixin' CLASS
NAMESPACE ::= 'namespace' IDENTIFIER '{' SCRIPT '}'
PARAMLIST ::= '(' ('void' | (TYPE TYPEMOD [IDENTIFIER] ['=' EXPR] {',' TYPE TYPEMOD [IDENTIFIER] ['=' EXPR]}) ')'
PRIMTYPE ::= 'void' | 'int' | 'int8' | 'int16' | 'int32' | 'int64' | 'uint' | 'uint8' | 'uint16' | 'uint32' | 'uint64' | 'float' | 'double' | 'bool'
NUMBER ::= single token: includes integers and real numbers, same as C++
RETURN ::= 'return' [ASSIGN] ';'
SCOPE ::= [[IDENTIFIER] '::' {IDENTIFIER '::'}]
SCRIPT ::= {IMPORT | ENUM | TYPEDEF | CLASS | MIXIN | INTERFACE | FUNCDEF | VIRTPROP | VAR | FUNC | NAMESPACE | ';'}
STATBLOCK ::= '{' {VAR | STATEMENT} '}'
STATEMENT ::= (IF | FOR | WHILE | RETURN | STATBLOCK | BREAK | CONTINUE | DOWHILE | SWITCH | EXPR_STAT)
STRING ::= single token: single quoted ', double quoted ", or heredoc multi-line string """
SWITCH ::= 'switch' '(' ASSIGN ')' '{' {CASE} '}'
TYPE ::= ['const'] SCOPE DATATYPE ['<' TYPE {',' TYPE} '>'] { ('[' ']') | '@' }
TYPEDEF ::= 'typedef' PRIMTYPE IDENTIFIER ';'
TYPEMOD ::= ['&' ['in' | 'out' | 'inout']]
VAR ::= ['private'|'protected'] TYPE IDENTIFIER [( '=' (INITLIST | EXPR)) | ARGLIST] {',' IDENTIFIER [( '=' (INITLIST | EXPR)) | ARGLIST]} ';'
VARACCESS ::= SCOPE IDENTIFIER
VIRTPROP ::= ['private' | 'protected'] TYPE ['&'] IDENTIFIER '{' {('get' | 'set') ['const'] [('override' | 'final')] (STATBLOCK | ';')} '}'
WHILE ::= 'while' '(' ASSIGN ')' STATEMENT
Jun 23, 2015 at 4:59 AM
Edited Jun 23, 2015 at 5:22 AM
edit: removed sourcecode chunks as it is now available on gist.guthub
Jun 23, 2015 at 4:59 AM
Edited Jun 23, 2015 at 5:23 AM
edit: removed sourcecode chunks as it is now available on gist.guthub
Jun 23, 2015 at 5:02 AM
Edited Jun 23, 2015 at 5:23 AM
edit: removed sourcecode chunks as it is now available on gist.guthub
Jun 23, 2015 at 5:03 AM
Edited Jun 23, 2015 at 5:23 AM
edit: removed sourcecode chunks as it is now available on gist.guthub
Jun 23, 2015 at 5:04 AM
Edited Jun 23, 2015 at 5:23 AM
edit: removed sourcecode chunks as it is now available on gist.guthub
Coordinator
Jun 23, 2015 at 5:09 AM
Please use some code share site like gist or smth for your grammar.
https://gist.github.com/\
Jun 23, 2015 at 5:20 AM
Edited Jun 23, 2015 at 5:29 AM
Ok I have added the grammar to gist.github, I will edit the above posts to remove the huge sourcecode chunks to make the thread more readable.

https://gist.github.com/anonymous/f2186d671b8cbe71c91e

(I edited the link to also include the grammar in a file)
Coordinator
Jun 23, 2015 at 5:42 PM
Oh man, sorry to tell you this, but this grammar is really messed up. Just launch GrammarExplorer and look at Non-Terminals tab. The list opens with 4 (!!!) identical ** non-terminals. Start scrolling and you'll find many other duplicates. This is really bad, and should be cleaned up first. After looking at grammar code, I guess the reason is that you use all of these helper methods that construct non-terminals on the fly. This is a bad thing; while in general in .NET code there's nothing wrong with having multiple copies of the same thing, in LALR grammar it is VERY BAD THING - a primary source of conflicts - which I see you tried to fix with hints and custom actions.
So start with removing all duplicates. General rule/recommendation - do not use non-terminals constructed on-the-fly, everything should be declared explicitly with appropriate name. One example - you have Optional() method, it does the same as ".Q()" method already in Irony (Q for Question mark) - do not use it, use of Q() is strongly discouraged too. Declaring everything explicitly helps you keep grammar slim and avoid bogus conflicts. This is the reality of LALR algorithm, not Irony's quirks. Comment out all these custom parser actions, hints, etc - these may come only at the very end, when you clearly understand the remaining conflicts.
One thing - you have a Number and BITS terminals, identical by definition, and used inside one rule with OR between them. This is another type of duplicate that Irony cannot detect, but which can become trouble. Since they are identical by definition (call to the same CreateAngelNumber function), they are different only semantically, not syntactically - and therefore you should use just one terminal for both - and probably re-classify some Numbers as Bits in semantic analysis (after parsing)
Having so many dups with the same name makes it impossible to debug/analyse the grammar.
In general, throw away the original BNF describing the language. From my own experience, these descriptive grammar are good for describing the language to humans but unfit for LALR parser. Start from scratch, designing grammar from you own understanding of the language. A good starting point is C grammar.
With C-style language, a common problem is the lists of modifiers (public, private, static etc) preceding element definition. These lists are slightly different for each element type (class or function or prop). The solution is to use the same (one and the only, all-inclusive) list term for all element types, and later detect improper use of modifier in after-parse analysis (by running through parse tree).
I would also advise, at least initially, to remove all this fancy stuff (like char ranges which look suspiciously like Unicode categories/ language pages), slim down to very basics and try to build an initial grammar with minimal amount of conflicts. Start using hints ONLY after you tried everything else, and you clearly understand the nature of the conflict to fix, and you know that this is language ambiguity (like dangling else), so it MUST be solved by hint. Minimize number of terminals and non-terminals. By the way, there's nothing wrong with using direct string literals inside expressions, without explicitly declaring them - they are converted to term objects automatically.
So my advise - cleanup first.
Roman
Jun 24, 2015 at 1:19 AM
Thank you, that already helps a lot. I was treating it like it required a strict grammar match to exactly fit the grammar specification. But I see now that more of a loose match is done to set up the parse tree where the various nodes are then visited in a later (semantics) pass to perform the stricter match.

I am rewriting it now to simplify it. You say not to use Optional() or Q(), but is it ok to use MakeStarRule() if it has a named NonTerminal (and optional delimiter)?
Or should I replace MakeStarRule() calls with my own handcoded rules for doing star rule?
Coordinator
Jun 24, 2015 at 4:31 AM
why you need your own hard-coded code for making lists?! I would recommend to stick with standard Irony-implemented methods - just inspect source code of all variations, see what they do. Pretty sure they cover most if not all cases - at least what I expect to be in c-like language like AngelScript. If MakeStar/PlusRule do not work for you, be very careful inventing your own - again, LALR parsing is very sensitive to seemingly little things. Note that these methods are just helper/shortcuts over LALR, abiding to LALR limitations. So if these methods don't work for you, you can just fall back to traditional definition of lists thru a couple of recursive bnf rules.
Jun 24, 2015 at 4:55 AM
What he meant by that was if it was okay to use the MakeStar/PlusRule methods. Or if he should make his own NonTerminal that does what MakeStar/PlusRule do for us before. So instead of calling the MakeStar rule he defines a non terminal in the constructor and does the same thing that the methods do for us which is creates a new terminal and set its flags and such. Since you mentioned we shouldnt be using the Optional or Q methods because they create an extra nonterminal so he just wanted to be sure if this applies for the StarPlus methods as well since they also create a new one. The make star rule function that he created was just a wrapper to help. It would name the list item so it didnt appear as unnamed when the MakeStar rule creates the new nonterminal. Because when you pass it a BnfExpression they arent named so this gives us the ability to set that name for it. Thanks for all the help, you really have clarified a lot of things.
Jun 24, 2015 at 7:33 AM
Yeh what Anth0ny said. Just wondering if it is ok to use the built in MakeStarRule and MakePlusRule methods.


Also is there any preference as to how this should be handled:
A ::= (B | C) + (D | E)

Which of the following should I do?

1) A.Rule = (B | C) + (D | E)

2) A.Rule = B + D | B + E | C + D | C + E

3) BC.Rule = B | C
DE.Rule = D | E
A.Rule = BC + DE
Coordinator
Jun 24, 2015 at 4:35 PM
Edited Jun 24, 2015 at 10:17 PM
1 - nothing wrong with it. The only trouble will come if you start defining AST nodes (for interpreter), then AST node for A would have to deal with a possible variety of child node types
Jun 24, 2015 at 10:15 PM
It's been rewritten as you suggested without use of custom actions and by not creating terminals at runtime and it seems like most of it is correct besides the 3 reduce-reduce errors which I cant seem to spot where its coming from.

http://pastebin.com/8qyXJrx7
Jun 27, 2015 at 12:31 AM
Thanks for you help Roman, by loosening the restrictions and doing various rule rewrites I was able to get it to parse correctly (finally).
Coordinator
Jun 27, 2015 at 4:04 AM
Great! and good luck to you guys