Disambiguating tokens

Nov 8, 2012 at 8:32 PM
Edited Nov 8, 2012 at 8:32 PM

Hi Roman,

How do I get the Scanner to disambiguate tokens based on greatest matching length?

Thanks very much.

Pieter

Coordinator
Nov 9, 2012 at 5:59 PM

Look at keywords handling, KeyTerm. The trick there is terminal priority that is computed by adding keyword length to base value, so longer keywords automatically get higher priority

Nov 9, 2012 at 7:01 PM
Edited Nov 9, 2012 at 7:05 PM

Not keyword length, or pattern length. How do I disambiguate tokens based on matching length, as for different types of NumberLiterals? Lex always disambiguated based on matching length, and broke ties based on an prioirty implied from definition order.

Coordinator
Nov 9, 2012 at 7:06 PM

I think it is done automatically to a certain degree. Look at Scanner.MatchTerminals method. If there are more than one terminals to match with the same priority, scanner selects one with the largest length

Nov 9, 2012 at 7:15 PM

Doesn't work. Consider the input

a := 4 + 5.0

in a grammar that need to define IntegerLiteral and FloatLiteral separately. If IntegerLiteral.Prioirty > FloatLiteral.Prioity, then "5" is incorectly recognized as a token and FloatLiteral is never tested, causing a parse failure on "."; otherwise, "4" is identifed as either a FloatLiteral or an IntegerLiteral in an uncontrolled fashion.

Coordinator
Nov 9, 2012 at 7:16 PM

yes, that's the idea - Priority comes first. Make the priority the same, then it should work

Nov 9, 2012 at 7:20 PM

I attempted this:

public class NumberLiteralX : NumberLiteral {
   public NumberLiteralX(string name) : base(name) {}
   public NumberLiteralX(string name, NumberOptions options) : base(name, options) {}
   public NumberLiteralX(string name, NumberOptions options, Type astNodeType)  
      : base(name, options, astNodeType) {}

   public override Token TryMatch(ParsingContext context, ISourceStream source) {
      var token = base.TryMatch(context, source);
      if (token != null) {
         base.Priority = this.Priority + 2*token.Length;
      }
      return token;
   }
   public new int Priority { get; set; }
}

but you have statically sorted the search list based on Priority at Init time, causing the idea to fail.

Nov 9, 2012 at 7:21 PM
Edited Nov 9, 2012 at 7:26 PM

The whole question arose when, with prioirty left at default, I saw "4" consistently [edit] being recognized as a FloatLiteral instea of as an IntegerLiteral. If it had worked by default, there would have been no question.

 

Nov 9, 2012 at 7:28 PM
rivantsov wrote:

yes, that's the idea - Priority comes first. Make the priority the same, then it should work


Think about that statement a bit.

If prioirty is the same, and matching length is the same, how does this provide the ability to predictably disambiguate between the two token definitions.

Nov 9, 2012 at 7:47 PM

Where are you sorting the token definitions? Perhaps use of an  unstable sort is causing the problem, since an implied prioirty based on definition order would be suitable.

Nov 9, 2012 at 8:02 PM
Edited Nov 9, 2012 at 8:03 PM

Ah-ha!

From the documenttion for List<T>.Sort() (my emphasis):

This method uses Array.Sort, which uses the QuickSort algorithm. This implementation performs an unstable sort; that is, if two elements are equal, their order might not be preserved. In contrast, a stable sort preserves the order of elements that are equal.

Coordinator
Nov 9, 2012 at 8:55 PM

Well, the idea is that you should never do this - at least that was basic thought in Scanner design. Do not declare FloatTerminal and IntTerminal in one grammar. Use float terminal and detect improper number after parsing. One exception is when the correct choice can be helped by parser, based on current parser state. In GwBasic grammar, the Open File statement allows file number which is integer only, so I define a separate int literal for this place, and parser gives scanner a hint to use it in this place (through back link Parser->Scanner).

Otherwise, my original thinking - do not do it. Scanner should be simple (stupid), and not have to decide about different kind of numbers you might have

Roman

PS one workaround is to define IntLiteral in grammar, but use only FloatLiteral in grammar rules; then hook to FloatLiteral.ValidateToken event, and for integers replace the Terminal value in the output token. 

Nov 9, 2012 at 9:26 PM
Edited Nov 9, 2012 at 9:33 PM

Yes, this a rare and uncommon situation for well designed grammars; but not an impossible one. I am working an a grammar that allows IntegerLiterals in a large number of places, and only in one particular construct allows either a Float or an Integer. The elegant solution for this grammar is two tokens, providing a clean, simple design. No, it is not a "programming language" in the sense you are used to, but that doesn't make it invalid.

Would it be too much to ask you to use a stable sort when sorting Terminals? I found two likely candidates on the web,  either using LINQ "orderby" or an Extension to the Array clas as described here:
http://stackoverflow.com/questions/148074/is-the-sorting-algorithm-used-by-nets-array-sort-method-a-stable-algorithm

As an additional argument, I believe it is incumbent on basic tools such as Irony.NET to internally always use deterministic algorithms when possible, which mandates a stable sort over an unstable one.

Pieter

Coordinator
Nov 9, 2012 at 9:30 PM

well, nonstable does not mean non-deterministic by the way. 

For your case, why don't you use Float in this one particular construct, and make Integer higher priority? then integers will always parse first, and in this one case parser will force to use Float

Nov 9, 2012 at 9:54 PM
Edited Nov 9, 2012 at 9:54 PM
rivantsov wrote:

well, nonstable does not mean non-deterministic by the way. 

Are you claiming that the MSDN documentation guarantees that all DOT NET implementations of Quicksort, for all hardware and software platforms, for all versions of the DOT NET framework past, present, and future, will produce the identical sorted list when presented with a given input list with duplicates. I find it highly doubtful that guarantee exists, and even more doubtful that it would be published even if internally intended. Even if the algorithm is deterministic to someone with access to the relevant source code, when the reliability of my parser is unpredictably dependent on variables outside my control, that is effectively non-deterministic where I stand.

As an aside:
For the designer of such an elegant tool, you are strangely persistent in forcing your 'clients' to adopt inelegant usages. Domain Specific Languages have a variety far wider than even that encompassed by COBOL-Python-Ruby-APL-Prolog-Lisp.  Being too focused on "programming languages" will only drive potential users back to the old Lex/Yaccc tools, which would be a shame.

Coordinator
Nov 9, 2012 at 10:13 PM

easy man, pleaase... before accusing me of 'forcing to adopt inelegant usages', think of what you are suggesting. Here's you initial post about sorting that started this issue on stable sort:

Where are you sorting the token definitions? Perhaps use of an  unstable sort is causing the problem, since an implied prioirty based on definition order would be suitable.

 

'..based on definition order'. WHAT is definition order?! the order in source code lines that declare terminals?! This does not exist for parser builder. Parser builder collects terminals from the rules trees starting from the root. The initial order of terminal list is therefore unpredictable. To force certain order, I provide Priority property. There is no 'initial' or definition order, period. And using 'stable' sort does not have any meaning. 

Believe me, spent a lot of time thinking and tweaking all these issues. 

Regards

Roman

 

Nov 9, 2012 at 10:24 PM

Okay! I take your point. Nobody ever accused me of being overly passive when I think I am right. (A subtly different subset from when I am right.) I had not been able to penetrate the code to that depth yet, and all I had seen was so far was TerminalList.Sort().

I cut my teeth on Lex & Yacc > 20 years ago, and old thought patterns sometimes die hard.  Let me think on it some more. I will also check out the back-link you described above.

Thank you for taking the time to work through this with me.

Pieter

Nov 10, 2012 at 12:52 AM

Roman,

Thank you again for your patience and thick skin this afternoon.

My mistake led from thinking that I needed the NonTerminal numeric defined as
numeric.Rule = IntegerLiteral | FloatLiteral;
with the situation simply growing worse as I simplified the test case to that up in the forum.

And from my determination to use NumberLiteral instead of falling back on safe, comfortable, well understood RegexBasedTerminal that you keeps trying to talk me out of.
;-)

You keep wondering why I use RegexBasedTerminal so frequently and by preference; I think much of it lies in this simple ratio:
RegexBasedTerminal: 72 lines with 4 conditionals.
NumberLiteral: 540 lines with 69 conditionals (if’s and switches)

After a 10 or 20 minute perusal, I was completely comfortable that I know how RegexBasedTerminal worked. After a few hours of studying NumberLiteral, I still don’t understand how or why it works.

Regards,

Pieter

From: rivantsov [email removed]
Sent: Friday, November 09, 2012 5:14 PM
To: pgeerkens@hotmail.com
Subject: Re: Disambiguating tokens [irony:402457]

From: rivantsov

easy man, pleaase... before accusing me of 'forcing to adopt inelegant usages', think of what you are suggesting. Here's you initial post about sorting that started this issue on stable sort:

Where are you sorting the token definitions? Perhaps use of an unstable sort is causing the problem, since an implied prioirty based on definition order would be suitable.

'..based on definition order'. WHAT is definition order?! the order in source code lines that declare terminals?! This does not exist for parser builder. Parser builder collects terminals from the rules trees starting from the root. The initial order of terminal list is therefore unpredictable. To force certain order, I provide Priority property. There is no 'initial' or definition order, period. And using 'stable' sort does not have any meaning.

Believe me, spent a lot of time thinking and tweaking all these issues.

Regards

Roman