Keywords vs Identifiers

Apr 4, 2013 at 10:59 AM
Hi there,

I'm new to Irony and am trying to create a relatively simple BASIC interpreter/parser but have run into a block.

How can I make sure that CLS2 is rejected as an identifier, and instead, the parser finds the cls_stmt which consists of "cls" + number?

I've tried a few random things but they haven't worked so I'm hoping for a point in the right direction.

Many thanks.
Apr 4, 2013 at 12:44 PM
Seem to be making progress by using a RegexBasedTerminal to define 'CLS'. Not sure if this is the based way of doing it but right now it looks successful...
Apr 4, 2013 at 1:40 PM
Edited Apr 4, 2013 at 1:41 PM
Hey Rinkwraith,

If you could provide a small sample of your grammar, it would help out with determining the problem.

From what you described, it appears that you have a couple of rules where a Non-Terminal and an Identifier are similar and the Identifier is returned when you are expecting the Non-Terminal.

In this situation, it's best to simplify the rules further and possibly use hints or token priority so that the parser knows which path to take. The RegExBasedTerminal will work, but this should be used as a last resort due to performance.

Regards,
Kevin
Apr 4, 2013 at 1:55 PM
Sure thing. Here's what I've been working on - it only supports very simple assignment statements and the CLS instruction at present.

Using the RegexBasedTerminals - which seems to be working ok at present (I used the same option to deal with the optional space after a linenumber too):
using System;
using Irony.Parsing;

namespace xBASIC
{
    [Language("xBASIC", "1.0", "")]
    public class XBASIC : Grammar
    {
        public XBASIC()
            : base(false)
        {
            this.GrammarComments = "";

            var CLS_KW = new RegexBasedTerminal("cls");
            CLS_KW.Name = "KEYWORD";

            // Terminals
            var lineNumber = new RegexBasedTerminal("[0-9]+");
            lineNumber.Name = "LINE_NUMBER";

            var number = new NumberLiteral("NUMBER", NumberOptions.IntOnly);
            var variable = new IdentifierTerminal("Identifier");
            variable.ValidateToken += new EventHandler<ValidateTokenEventArgs>(variable_ValidateToken);

            var stringLiteral = new StringLiteral("STRING", "\"", StringOptions.None);
            var userFunctionName = variable;
            var comment = new CommentTerminal("Comment", "REM", "\n");
            var short_comment = new CommentTerminal("ShortComment", "'", "\n");
            var comma = ToTerm(",", "comma");
            var colon = ToTerm(":", "colon");

            // Non-terminals
            var ASSIGN_STMT = new NonTerminal("ASSIGN_STMT");
            var CLS_STMT = new NonTerminal("CLS_STMT");

            var COMMENT_OPT = new NonTerminal("COMMENT_OPT");
            var EXPR = new NonTerminal("EXPRESSION");
            var EXPR_LIST = new NonTerminal("EXPRESSION_LIST");
            var LINE = new NonTerminal("LINE");           
            var PROGRAM = new NonTerminal("PROGRAM");
            var STATEMENT_LIST = new NonTerminal("STATEMENT_LIST");
            var STATEMENT = new NonTerminal("STATEMENT");

            // set the PROGRAM to be the root node of BASIC programs.
            this.Root = PROGRAM;

            // BNF Rules
            PROGRAM.Rule = MakePlusRule(PROGRAM, LINE);

            LINE.Rule = NewLine | lineNumber + COMMENT_OPT + NewLine | lineNumber + STATEMENT_LIST + NewLine | lineNumber + STATEMENT_LIST + COMMENT_OPT + NewLine | SyntaxError + NewLine;
            LINE.NodeCaptionTemplate = "Line #{0}";

            STATEMENT_LIST.Rule = MakePlusRule(STATEMENT_LIST, colon, STATEMENT);
            COMMENT_OPT.Rule = short_comment | comment | Empty;

            STATEMENT.Rule = CLS_STMT | ASSIGN_STMT;
            ASSIGN_STMT.Rule = (Empty | "let") + variable + "=" + EXPR;
            CLS_STMT.Rule = CLS_KW + PreferShiftHere() + (Empty | number);

            EXPR.Rule = number | variable | stringLiteral;
        }

        void variable_ValidateToken(object sender, ValidateTokenEventArgs e)
        {
            if (e.Context.CurrentToken.ValueString.StartsWith("cls"))
            {
                e.RejectToken();
            }
        }

    }
}
And this is how it was when I first posted - i.e. a grammar that detects CLS5 as an identifier and then fails expecting an = sign in an assignment statement.
using System;
using Irony.Parsing;

namespace xBASIC
{
    [Language("XBASIC", "1.0", "")]
    public class XBASIC : Grammar
    {
        public XBASIC()
            : base(false)
        {
            this.GrammarComments = "";

            // Terminals
            var lineNumber = new NumberLiteral("LINE_NUMBER", NumberOptions.IntOnly);

            var number = new NumberLiteral("NUMBER", NumberOptions.IntOnly);
            var variable = new IdentifierTerminal("Identifier");

            var stringLiteral = new StringLiteral("STRING", "\"", StringOptions.None);
            var userFunctionName = variable;
            var comment = new CommentTerminal("Comment", "REM", "\n");
            var short_comment = new CommentTerminal("ShortComment", "'", "\n");
            var comma = ToTerm(",", "comma");
            var colon = ToTerm(":", "colon");

            // Non-terminals
            var ASSIGN_STMT = new NonTerminal("ASSIGN_STMT");
            var CLS_STMT = new NonTerminal("CLS_STMT");

            var COMMENT_OPT = new NonTerminal("COMMENT_OPT");
            var EXPR = new NonTerminal("EXPRESSION");
            var EXPR_LIST = new NonTerminal("EXPRESSION_LIST");
            var LINE = new NonTerminal("LINE");           
            var PROGRAM = new NonTerminal("PROGRAM");
            var STATEMENT_LIST = new NonTerminal("STATEMENT_LIST");
            var STATEMENT = new NonTerminal("STATEMENT");

            // set the PROGRAM to be the root node of BASIC programs.
            this.Root = PROGRAM;

            // BNF Rules
            PROGRAM.Rule = MakePlusRule(PROGRAM, LINE);

            LINE.Rule = NewLine | lineNumber + COMMENT_OPT + NewLine | lineNumber + STATEMENT_LIST + NewLine | lineNumber + STATEMENT_LIST + COMMENT_OPT + NewLine | SyntaxError + NewLine;
            LINE.NodeCaptionTemplate = "Line #{0}";

            STATEMENT_LIST.Rule = MakePlusRule(STATEMENT_LIST, colon, STATEMENT);
            COMMENT_OPT.Rule = short_comment | comment | Empty;

            STATEMENT.Rule = CLS_STMT | ASSIGN_STMT;
            ASSIGN_STMT.Rule = (Empty | "let") + variable + "=" + EXPR;
            CLS_STMT.Rule = "cls" + (Empty | number);

            EXPR.Rule = number | variable | stringLiteral;
        }
    }
}
There's still a shift-reduce warning on the LINE rule which I need to look at.
Apr 4, 2013 at 4:37 PM
I have not got to experiment any with your provided code, but the issue is with these two lines:

ASSIGN_STMT.Rule = (Empty | "let") + variable + "=" + EXPR;
CLS_STMT.Rule = "cls" + (Empty | number);

The issue is the CLS_STMT can be just "cls" since you have "cls" + Empty as a path. With the terminal variable being just an open Identifier, the parser doesn't know if variable in ASSIGN_STMT should resolve to an Identifier or follow the CLS_STMT rule.

I believe the quickest solution is to make "cls" a reserved keyword. You do this by using the MarkReservedKeyword method (http://irony.codeplex.com/discussions/403654). This may not give you the desired result if you want to allow for cls to be a variable name though.

The only other solution is to rewrite CLS_STMT to not have the path "cls" + Empty.

It's actually a good habit to Register Braces, Register Operators, and Mark Reserved Keywords to help the parser. It also helps the syntax colorization in the Grammar Explorer.

I hope this helps you out.

Regards,
Kevin
Apr 5, 2013 at 7:38 AM
Unfortunately, that didn't work - but your explanation as to why there's a problem makes perfect sense, so thank you for posting.

It stops cls being used as a variable but still interprets cls5 as an identifier instead of preferring the combination cls+Number.

Does Irony have a general problem working with languages which don't have clear rules for how to divide tokens?

The regular expression approach got messy very quickly when trying to deal with other occurrences of the same issue - so I've had to abandon that.
e.g. 10x=5mod6
which is perfectly valid in the old BASIC interpreters - presumably because they didn't use the same kinds of parsing techniques everyone employs now.
Coordinator
Apr 6, 2013 at 6:10 PM
There's a flag KeyTerm.AllowAphaAfterKeyword, set it to true for 'CLS' keyword/reserved word, this would allow 'cls5' to parse as separate tokens