An easier way to create AST nodes?

Feb 20, 2011 at 3:04 PM

First, congratulations on Irony, which makes writing scanners and parsers a breeze. It would be even better if the documentation was more thorough, but it's something that should be addressed by the users (I plan on updating the wikibooks based on what I've understood and the surprisingly helpful comments inside Irony itself).


However, I feel that there should be a easier, less redundant way to create AST Nodes (disclaimer: it's possible there's something I'm entirely missing here, feel free to correct me)

Right now you associate to a Non Terminal either a Node type that will be instanciated while constructing the Parse Tree, or a delegate that will do the a custom instanciation.

That's fine but inside the node you have to walk the parse tree, essentially duplicating the work done in the grammar itself.


For instance:

            Try.Rule = TRY + Block
                       | TRY + Block + Catch
                       | TRY + Block + FINALLY + Block
                       | TRY + Block + Catch + FINALLY + Block;

Associating TryNode to this, I'll have to parse the nodetree inside the init method, essentially restating the 4 possible productions. I could split these productions using 4 different non terminals and specify a different delegate for each, but it's tedious.

The ideal thing would be to have actions inside the production rule (like Bison does), and a kind of matching construct

jb evain suggested to me using something like:

            Try.Rule = TRY + Block   > (a,b) => delegate taking the two items
                       | TRY + Block + Catch > (a,b,c) => ...
                       | TRY + Block + FINALLY + Block > (a,b,c,d) => ...
                       | TRY + Block + Catch + FINALLY + Block => (a,b,c,d) => ... ;

But unfortunately this syntax is not possible in C# since you can't override operators in a generic way.

In a more classic way,

Try.Rule = ...

Try.Rule.Actions = new []{(a,b) => ..., (a,b,c) => ..., (a,b,c,d) => ..., (a,b,c,d) => ...};

where the Actions array map to the different |'ed terms.

I'm currently porting the coffeescript compiler to .Net (using Irony), and the original coffeescript parser uses a very neat grammar DSL over jison (see ), jison being a port of Bison in javascript.


Feb 21, 2011 at 12:25 AM

The idea is to have a single "+" expression in non-terminals like TRY, with optional elements. In this case, the list of child nodes for Init is always the same (length), but some of the children can be empty

(have zero children, meaning they are non-existent in source code). 

So modify the rules for CATCH, FINALLY to allow them be empty. As for ending block, define a block_opt non-terminal; you may need to add one check that grammar does not enforce: 

that two blocks without CATCH or FINALLY between them is not allowed; you should add this check to AST.Init method.

Hope this helps

Feb 21, 2011 at 3:54 PM

Thanks for replying.

So basically, if I understand correctly, I have to come up with a single production rule for each of my potential AST node. I'm not sure it makes the rules very readable, and some of the Irony samples don't go that way (the C# grammar or the Java one, for instance).

Moreover, going this way makes using the default Irony Ast nodes very tedious, since the Init methods of the nodes assume a specific structure in the ParseTreeNode that is passed.

So if your production rule is slightly different from the one hardcoded in the node, you have to either specify a delegate that acts as an adapter between the two structures, or provide your homebrew ast node which will have to do the manual matching.

I have the impression that having some way to embed the matching rules inside the grammar itself (and not in the node) would make the code a lot less redundant and also make the nodes not too dependant on the exact parsetree structure from the rules.

I'm not saying it's easy, or even feasible, but the overall very very nice experience of writing grammars in Irony becomes harsher when it gets to building an AST.





Feb 23, 2011 at 6:02 AM

Believe it or not, but I've gone thru similar path of thinking. And finally decided to settle with the solution/recommendation as I explained before. Here's the reasoning. 

The parser/scanner for a language might be just a verifier - for syntactic correctness... or token type recognizer - for code colorizer. In these simple cases you don't need AST tree, parse tree is enough.

If you build interpreter, you need AST. Next, AST node of specific type represents the most generic form of language construct. Let's look at TRY, but for c#, it's easier. 

Your AstTryNode would probably have properties that would contain subelements of the try: TryBody, CatchBlockList (list of nodes), FinalBlock (optional).

CatchBlockList might be empty list, or FinalBlock might be empty, but AstTryNode must still have these properties. Now, the main function of parser becomes not just describing a language in some "readable"

way, but MAPPING the input text of Try-block to the elements/properties of TryAstNode. You have to have this mapping somewhere. You may describe BNF rule for a node in several language-equivalent ways,

but if it is a set of variations, you'll have to do an extra job of mapping. I reasoned that the best way and easiest way to do this is to build one generic definition of the construct:

Try-Catch-Finally-block is a TryBlock followed by zero or more CatchBlocks followed by optional FinallyBlock. Then mapping inside AST node initializer becomes trivial. 

That seems kinda giving up some freedom in grammar expressions, but you have to remember that the goal is AST construction, not nicely looking stuff.

There is one big trouble with this approach, and I still do not have a good solution for this. Writing a rule as a sequence of optional sub-clauses works ok

if each clause has distinctive starting keyword - like TRY construct. If there are no such keywords, you get shift/reduce conflicts. (this is limitation of LALR parsing algorithm, not Irony per se).

That's what happened for c# and Java grammar, that's why you see all these expanded expressions with variations, instead of a single sum of optional elements.

But for these grammars, we do not need ASTs, we needed just recognition of constructs - for colorizing the text. 

Not sure how satisfying is my explanation, but that's all I have - in any case, I do not have a better solution - tried, but could not find anything satisfying.