Jan 28, 2008 at 7:12 PM
Edited Jan 28, 2008 at 7:24 PM
I’ve struggled to reliably identify a cell reference type by defining a specific terminal class. The problem is with cell references of this type:


It’s a problem because to parse this terminal it must be allowed to include + and – which are also operators. The challenge is working out when has the terminal finished. Of course complex rules can be written to terminate parsing the terminal at the final +1 but this is awkward and not declarative. If I want to write this difficult-to-maintain code I’d not use Irony. This has forced me back to using a grammar based approach again and to understand why my earlier use of a grammar based approach failed. In investigating the reason, I think I’ve uncovered a weakness in Irony because the rationale for terminal selection can have more to do with side-effects of the algorithms used instead of explicit grammar rules.

Let me provide an example to work through as it’s trivial. Here’s an expression:


An Excel expression is made up of many parts (though only one is used in this case). Part of the expression is defined as:


Where, of course, all these NTs are defined. For example:


A1COLUMNREFERENCE is a sequence of characters (A..IV) and ROWREFERENCE is an integer number with or without a unary operator. I used custom terminals instances to define both these terminals.

The problem is Irony parses the “1” in the A1 as a FLOAT. At one level you can see that “1” can be a float, but the grammar rules should dictate that after A1COLUMNREFERENCE should be ROWREFERENCE not FLOAT. So why is Irony finding float? Custom terminals return a null to GetPrefixes(). Because of this, it is added to all the entries in _data.TerminalsLookup at the end of the BuildTerminalsLookupTable() method. Because TerminalsLookup is a Dictionary<> and because these entries are added after the fact, they appear at the end of the terminals list for any given character. This might not matter but the token that will be returned by ReadToken is the first token found in TerminalsLookup for a character (the character "1" in this case). Of course if another terminal is longer then it will be taken but that isn't the case here.

The first token for “1” will always be FLOAT because it’s the only terminal with a prefix for numerals. Both FLOAT and ROW_REFERENCE will read the same number of characters but FLOAT is first it will always “win” because its the first in the list of tokens for "1".

My hack is to create a specific Terminal instance for ROWREFERENCE so that it can return a valid prefix list. However even this is not enough. If my terminal name for FLOAT is “float” and for ROWREFERENCE is “rowreference” then FLOAT will still win because the terminals in the list data.Terminals are held in alphabetic order. The only way to have ROWREFERENCE take precedence is to set its name to, say, “arowreference”. This is a hack and seems like it’s the wrong thing to do. Anyway, it will backfire when reading a geniune numeric constant when I do want FLOAT to win.

It seems to me that the scanner is the wrong place to be making judgements about the suitability of one terminal or another because the scanner doesn’t (or at least the Terminal implementations don’t) have access to the context information needed to decide whether one terminal or another is more or less appropriate. This information is in the grammar and surely the decision about which is the “correct” token ought to be deferred until the parser has chance to look at all the tokens in context.

I guess this is somewhat like method overloading in the sense that as a compiler must choose the correct method to use based on its argument signature, so Irony could work out the best terminals to use based on the best fit with the grammar.

An alternative is to provide the existing token list to the the terminal implementations with the previous token so they can "see" their own context. In this case FLOAT would not see a preceding operator or open paren so would fail. ROWREFERENCE would see a preceding A1COLUMNREFERENCE and would succeed.

Jan 29, 2008 at 5:16 AM
First, the answer to your problem of explicit ordering of terminals is already there, in changeset 6381. For each terminal you can explicitly set Priority property (0 by default), and terminals with higher priority (greater number) will come first in the lists. I've recognized from the beginning that this explicit prioritizing will be needed, simply didn't get to this code until last source update. What is not there yet, is functionality in Scanner to IGNORE all terminals with lower priority when a terminal with high priority yields a token. Expect it in the next version.
However, I still think that this approach with treating cell reference as Non-terminal with multiple tokens inside is not the right thing to do, and I advise you to try again writing a custom terminal.
First, there's one basic thing about tokens - they can be separated by arbitrary number of whitespace characters, and Scanner simply ignores whitespace. Considering this, representing cell reference with multiple tokens might be incorrect - as far as I understand, it is the absense of spaces in "R1C+1" that makes it cell reference and not addition expression.
You don't like custom terminal because complex rules can be written to terminate parsing the terminal at the final +1 but this is awkward and not declarative. If I want to write this difficult-to-maintain code I’d not use Irony. However, please understand that you have to write some tricky thing anyway (in c# code or BNF rules), and not because of Irony's limitations - it is because cell reference is looks so similar to arithmetic expression; very unusual arrangement compared to expression syntax you find in other languages, but Excel designers had their reasons - provide flexible and simple way for user to reference cells in formulas.
To make code simpler, try using RegEx terminal - try to build reg expr for cell reference and use this Terminal. Note, I've never tested this terminal, so you may have to debug it (sorry!), but I envisioned it for cases like this - when terminal has some intrinsic regular structure; so it may fit well for your case.
thanks, and let me know how it goes
Jan 29, 2008 at 2:50 PM
Thanks for the update, I'll take a look. For the time being I'm experimenting with passing the last token to the TryMatch() function so that a terminal implementation can make a call about whether it should be used or return null even if it could return a valid token based on a little bit of context. Seems to work.

Bear in mind that I'm using the Excel formula a test case to learn Irony and workout whether its something I really can use and, if so, understand its flaws (all software has flaws and its better the devil you know). I'm focussed on parsing cell references as a grammar because its a good experiment. Its easy to see a cell reference as a terminal terminal but they are really little expressions in their own right. The BNF grammars are defined and if cell references are created from grammar definitions, the details of a cell definition can be learned from walking the resulting node tree. The alternative is complex little terminals that parse the characters and expose properties to access row and column values. Also, cells (a rowoffset plus a columnoffset reference) form into more complex structures

range: cell1:cell1 | column:column | row:row
intersection: range1 range2
union: range1,range2

These are more complex and also defined with grammars so the line has to be drawn somewhere. If these are worthy of a grammar, why not a cell? If cell terminal is to be written, why not a range terminal? If a range is a terminal, why not just write a parser by hand? Again there's a real benefit to Irony so I'm going to workout how to adapt Irony to be able to desribe a cell using a grammar. Wish me luck.

Jan 29, 2008 at 7:52 PM
Believe it or not, but Irony's scanner started exactly this way. I thought - as long as I have infrastructure for declaratively specifying grammars (for parser), why not use it for Scanner and specify token structure in the same way? It is the way Lex does it. And I abandoned it. REason - performance sucked. As nice as it sounds to specify token structure declaratively, it is certainly not as fast as the power of direct c# code that you have now in Terminal class. It is well known that the slowest part of compilers is scanning. So I decided to move to custom coded terminals. You have to make trade-offs; making declarative but slow scanner certainly would please some people but would definitely make it unusable for bigger crowd. And doing custom c# code in terminal isn't that bad - you have the power of real programming language!
Another note. Token has now Attributes dictionary - this is the place to store row/column numbers once you parsed the token. Alternatively you can define mini-class with row, col props and stick an instance into Value property of the Token.
As for providing contextful information to scanner (or parser or token filter) - in case you have to do it. The suggested mechanism in Irony is to use CompilerContext.Values dictionary. CompilerContext parameter passed everywhere is the place for storing context, so you can set some Value in parser or some termnal and then check it in another place. But I hope in your case you can live without it.
Excel formulas you picked for testing Irony - tough choice. I know, I may be sound too protective about my baby project, but pls don't forget that it is the "unusual nature" of excel formulas that gives you troubles; you would have the same problems in other parsers.
Jan 30, 2008 at 1:08 AM
Edited Jan 30, 2008 at 12:27 PM
You are OK being protective about your baby. You've more experience at this than me and you know the code. There's no way to get away from custom terminals and I now have a core set of 6: row ref (A1 and RC), col ref (A1 and RC), a float and an integer that understand when they should NOT be used. The float and number are re-implementations of NumberTerminal but the others inherit from IndentifierTerminal and use RegEx to workout if a part is valid. I take your point about performance. My new terminals are evaluated repeatedly - most often to rule out their use. Hand scanning the source would be faster - but more (and probably less modular) code. It's great to be able to define an option for the formula to include a function with argument list using a simple declaration like:

NAME + "(" + ARGUMENT + WithStar(COMMA + ARGUMENT) | Empty + ")"

How great is that! Just a little bit easier than writing it by hand - and easier to comprehend. Fortunately in the scenario I'm thinking of scanning/parsing performance is not an issue. So making any code easy to maintain is why I'm so obsessed with being able to use the grammar as much as possible.

Thanks for the heads-up about CompilerContext.Values.

Yes, Excel formulas are "unusual". But they're a good proxy for a formula orientated syntax I have to use in a project. Irony seems to be able to cope. The real challenge is for me to use Irony correctly. It's good that it can be adapted to handle these formula, simple expressions or whole syntaxes like Python.

Thanks for your help.
Jan 30, 2008 at 5:43 AM
by the way, the function's formula can be written even simpler:

NAME + "(" + ARGUMENT.Star(",") + ")"

take care
Jan 30, 2008 at 12:33 PM
So I've updated to 6381 and there are a lot of changes! The CompilerContext is especially useful as the previous token is available in context.Compiler.Parser.CurrentToken while the next token is being evaluated.

I have come across one change that's a problem for me. Because I know the grammar I'm aiming at I've created place holder NTs for each NT that will be used - some of them don't have a Rule yet. These NTs cause a problem in CreateProductions() on line 172 because it assumes Rule will always be valid. In a final version that will be true but perhaps not in development. Adding a test for the null condition to the beginning of this line works for me.