Parsing the contents of *some* string literals

Sep 25, 2009 at 3:44 PM

I've got a grammar that 90% of the time just treats string literals as string literals, however if a string starts with a particular prefix (lets say #), then I want to parse the contents of that string. For example:

"{foo:bar}"

should evaluate to just a string literal token, but

#"{foo:bar}"

should have its contents parsed.

When I try to parse the second example with my grammar, Irony complains that it was expecting a " (at the point where there IS a "), however i can see that the string literal has already been tokenised. No amount of playing with precedence or priority seems to help. If i remove all string literals from my grammar, then the contents of the string will be parsed.

Is this one of the cases mentioned at http://irony-roman.blogspot.com/2009/09/time-to-talk-about-future.html? A Sub-grammar? Or am i just doing it wrong?

If it's not implemented yet, can I help? If this feature is implemented, how do I use it?

P.S. awesome awesome project. 5 stars!

Sep 25, 2009 at 9:15 PM

You may have to create a custom terminal/literal to handle the functionality that you are describing. I believe Irony examines each character that hasn't been tokenized. Then it creates a list of potential terminals, which it iterates through until one is successful. However, if it's not successful, it moves on to try and determine what you meant.

In your first scenario, Irony comes up to a quotation mark and says, this is a StringLiteral and so it returns a StringLiteral and jumps to the last quotation mark to continue parsing. However, in the second scenario, Irony comes to the # and determines that there are no matching Terminal. Then it looks at the next character and says, I think you meant for this to be a StringLiteral.

Coordinator
Sep 25, 2009 at 9:43 PM

Thanks for the suggestion, MindCore, but I hope in this case there may be a solution without custom terminal. I don't know how you arranged your grammar, but here's my guess. You created StringLiteral terminal as usual. You also created non-terminal for fancy #-tagged string:

var strLit = new StringLiteral("String", "\"");

var exprStr = new NonTerminal(ExprString);

exprStr.Rule = "#\"" + Expr + "\"";

where Expr is an expression literal you already use for other, normal expressions in your grammar. I guess Expr literal also allows expressions involving normal literal strings (strLit) - right? The here's the problem.  Parser encounters ending quote, and cannot decide, what is this? the beginning of strLit literal, or the ending quote of exprStr structure. I bet the logic of your language actually does not allow this! So what you need to do is define a separate expression non-terminal defining what actually can be inside the strLit (which excludes strLit):

exprStr.Rule = "#\"" + embeddedExpr + "\"";

embeddedExpr.Rule = (....);

In this case then Irony Scanner will stop at ending quote and will try to decide what terminal to match. It will have two candidates (strLit and stand-alone double quote symbol ending exprStr). It will then ask Parser, which of these two can be expected here, and Parser will respond - only closing double-quote!

Try this, and let me know if it works.  

 

Coordinator
Sep 25, 2009 at 10:23 PM

One more comment. robfe, answering your question about the need for subgrammars in this case. In general - yes that would require a subgrammar, if you had a free-form string content with embedded expressions, like in Ruby. But if it is ONLY expression inside, without anything else - which is your case as far as I understood - then it might work as I described.

 

Sep 28, 2009 at 9:11 AM

Hi Roman

Thanks for your help. My "embeddedExpr" is already a nonterminal that cannot contain quotes. I think i will have to bite the bullet and post my whole grammar here, if you have time to look at it I'd really appreciate it.

And here's the constructor for my grammar:

        public ReproGrammar()
        {
            var identifier = new IdentifierTerminal("identifier");
            var stringLiteral = new StringLiteral("stringLiteral", "\"");

            //nonterminals
            var graph = new NonTerminal("graph");
            var ruleList = new NonTerminal("ruleList");
            var rule = new NonTerminal("rule");
            var state = new NonTerminal("state");
            var transition = new NonTerminal("transition");
            var attributes = new NonTerminal("attributes");
            var attributeList = new NonTerminal("attributeList");
            var attribute = new NonTerminal("attribute");
            var labelAttribute = new NonTerminal("labelAttribute");
            var parameterList = new NonTerminal("parameterList");
            var parameters = new NonTerminal("parameters");
            var parameter = new NonTerminal("parameter");
            var genericAttribute = new NonTerminal("genericAttribute");
            var value = new NonTerminal("value");




            graph.Rule = "digraph" + identifier + "{" + ruleList + "}";
            ruleList.Rule = MakeStarRule(ruleList, rule);
            rule.Rule = (state | transition) + (Empty | ";");
            state.Rule = identifier + attributes;
            transition.Rule = identifier + "->" + identifier + attributes;
            attributes.Rule = Empty | ("[" + attributeList + "]");
            attributeList.Rule = MakeStarRule(attributeList, Symbol(","), attribute);
            attribute.Rule = labelAttribute | genericAttribute;
            labelAttribute.Rule = Symbol("label") + "=" + "\"" + identifier + parameters + Symbol("\"");
            parameters.Rule = Empty | ("(" + parameterList + ")");
            parameterList.Rule = MakeStarRule(parameterList, Symbol(","), parameter);

            parameter.Rule = identifier + identifier;

            genericAttribute.Rule = identifier + "=" + value;

            value.Rule = identifier | stringLiteral;


            RegisterPunctuation("{", "}", ";", "[", "]", "->", ",", "=", "\n");
            MarkTransient(ruleList, rule, attributes, value, parameters);

            NonGrammarTerminals.Add(new CommentTerminal("SingleLineComment", "//", "\r", "\n", "\u2085", "\u2028", "\u2029"));
            NonGrammarTerminals.Add(new CommentTerminal("DelimitedComment", "/*", "*/"));

            Root = graph;
        }

I'm trying to treat any node or edge attributes (the comma seperated key value pairs within the square brackets) with a name of "label" differently to the others, parsing the contents as a formal c# method definition. I'm using the latest release of Irony, since there's the comment "All AST/Interpreter stuff is a work in progress, don't try to use it." in trunk.

I'm trying to parse the following string:

digraph g{x[foo="bar",label="FooBar(string s, int i)"] y; z x->z}

 (it's an extension of DOT), but the scanner is picking up a string literal when it gets to the value of the label.

I'll be using this in a new codeplex project, which I'm yet to publish. Cheers - Rob

Coordinator
Sep 28, 2009 at 3:42 PM

Ok, several things. First of all, forget about download release version, move to latest sources. The interpreter part works already to some extent, but I bet you don't need it at all. AST nodes construction works all the way.

Now how to fix it, the easiest thing to try is to mark "label" symbol as a reserved word in your language - because it is, as far as I understand. having "label" as an attribute name signals that its content string should be parsed. Use MarkReservedWords() method in Grammar class. I think then it will work, because with this parser-scanner link parser will assist scanner - but again, in latest sources version only.

finally, if you control the language (as you say, it is new codeplex project), then why don't you slightly change the rules and simply use single quote for strings that should be parsed? It would be easier for parser, and I think easier for programmers. OR use some prefix for double-quoted string like "@"; that's the way they do in c#, only the opposite. String literals without @ trigger some interpretation like treating escape sequences, while strings preceeded by "@" are accepted "as is".

Let me know what works for you...

Roman

Sep 28, 2009 at 5:07 PM

Hi Roman

Thanks for all your suggestions, I will try them all out. AST construction is all I need.

I am in fact piggybacking off an existing DSL: "Dot" (http://www.graphviz.org/), which is already quite a loose language, and I want to be able to parse all Dot files as well as the files with my extra label markup.

Will let you know how it goes

Cheers - Rob

Sep 29, 2009 at 4:00 PM

The parser works perfectly even without having to register a keyword. All I had to do was upgrade to trunk.

It was a lot of work to convert over to the new way of building ASTs though! You should do a release as soon as you can so that people who are new to the project dont start working against a codebase that has a lot of pending breaking changes.

P.S. Love Irony. Will be linking to you from my new project :)

 

Oct 23, 2009 at 8:51 AM

Hi Roman, thanks for all your help. If you want to see how I use Irony it's over at http://flit.codeplex.com/

Coordinator
Oct 23, 2009 at 6:20 PM

Thanks for the link. As a suggestion, it might be helpful to provide some real life example of Flit use. After browsing Flit wiki pages for some time, I still couldn't get the idea - where and how it can be used. I'm in business software development actually (Irony is just an aberration), so I guess Flit is aimed to solve the problems we have in business application development - huge problems in fact. But I could not get the idea, sorry. Maybe provide a scenario, starting with a simple business process/action description "Customer X buys gadget Y", or "Invoice must be reviewed by relevant approvers" - and then proceed with technical details how you can do this with Flit.

Thanks, and good luck with your project!

Roman