Parser slow to initialize, stuck in `Transition.IncludeTransition`

Sep 27, 2012 at 3:30 PM

I'm writing a parser for the PowerShell grammar. Creating the Parser object is taking a long time. Breaking in to the debugger, I see we're spending a lot of time in Transition.IncludeTransition. 

It takes about 18 seconds in a Release build running not under the debugger, on my relatively slow computer. ParserData.States has about 600 states.

What options are available to me to improve this performance?

I wonder if Irony's approach of building the parser at runtime is a bad fit for what I'm doing?

If you need to look at my code, the Rules are here: https://github.com/JayBazuzi/Pash2/blob/master/Source/Pash.System.Management/Pash/ParserIntrinsics/PowerShellGrammar.cs

Coordinator
Sep 27, 2012 at 5:49 PM

oh man, sorry, but this stuff is quite messed up. I downloaded the entire thing, and after a few troubles and workarounds finally managed to load and compile the solution. I added GrammarExplorer to the solution, built all, launched it and tried to load your Pash grammar into GrammarExplorer. First thing - you binary assembly is call System.Management.Automation (System?!), while project is Pash.System.Management - why? had to go back and look at project properties to see what's going on. 

Attempt to load Powershell grammar blew up with exception - no public parameterless constructor found. Looked in the grammar - there is constructor but it is private. Apparently, you never loaded and analyzed the grammar in Grammar Explorer - do this first!!! 

When I finally loaded all grammars in your assembly, it showed me an error on all but one of them: "NonTerminal X has uninitialized Rule property". All except InteractiveINput grammar - it spend about 4 seconds thinking, then threw up a whole bunch of conflicts. So my guess is: it's not really parser construction that takes long time, it is parser builder going around and reporting all errors. 

So the conclusion is - do the Grammar analysis with GrammarExplorer first!

Clean it up, fix the conflicts, try parsing samples.

As a side note: what's this all stuff with static class constructor and assigning nonterminals thru reflection?! I'm afraid you are too fancy here, and adding many more troubles in debugging and maintenance. The general recommendation for static constructor is - use it only if you really need it, and do as little as possible there. And keep in mind that Regex terminal is really slow, you should explore other options and use it only if nothing else really works - including custom terminal (your own custom class). 

Roman

Sep 27, 2012 at 6:11 PM

Thanks for the quick response, Roman.

OK, I will try GrammarExplorer and see where that gets me. I've never done that before, as you guessed.

To answer some of your questions:

I am writing an Open Source clone of PowerShell, which is why you see things in the System namespace.

I'm not sure why you weren't able to load the solution and compile. Which version of Visual Studio do you use? (I use VS2012 Express). What issues did you see / what workaround did you apply? I want to fix this. I should probably set up a Virtual Machine with a clean OS to test that on. 

The static stuff is going away - already fixed in my development branch, but not merged to master yet.

The reflection is to avoid the duplication with the name of the rule & the variable containing the rule:

    NonTerminal Foo = new NonTerminal("Foo"); // duplication

I use a lot of RegexTerminal because I receive the lexical grammar as a series of production rules in the language specification, which are easy to translate to regular expressions. However, the resulting parse tree is hard to consume. The more complex RegexTerminals will become NonTerminals.

I understand that the RegexTerminal is slow, and I'm OK with that for now. Startup performance matters to me much more than parsing performance. It's an interpreted language: users will spend far more time waiting for commands to run than waiting for parsing. But I'll get to that eventually, too.

Sep 27, 2012 at 7:23 PM

re:

The reflection is to avoid the duplication with the name of the rule & the variable containing the rule:

    NonTerminal Foo = new NonTerminal("Foo"); // duplication

C# is a great language, but I do sometime miss old fashioned C macros.
;-)

Coordinator
Sep 27, 2012 at 8:21 PM

you gotta be kidding... to go on reflection mess and huge static constructor for THIS?!!!

Sep 27, 2012 at 9:05 PM
pgeerkens wrote:

re:

The reflection is to avoid the duplication with the name of the rule & the variable containing the rule:

    NonTerminal Foo = new NonTerminal("Foo"); // duplication

C# is a great language, but I do sometime miss old fashioned C macros.
;-)

Yes, C macros are almost ideally suited to parser generation, and highly troublesome almost everywhere else!

Sep 27, 2012 at 9:14 PM
rivantsov wrote:

you gotta be kidding... to go on reflection mess and huge static constructor for THIS?!!!

As I mentioned, the static constructor is gone in my working branch.

The reflection is straightforward if you like reflection, and ugly if you don't like reflection. I guess you're in that second camp, and there's nothing wrong with that. 

I had considered suggesting an enhancement to Irony where it reflects over the user's grammar to discover the names of the rules, instead of requirement to be hard-coded, but I'm guessing you wouldn't go for that!

I'm open to other suggestions as to how I can cleanly meet the requirements of the Irony API without that duplication and the risks that carries.

Coordinator
Sep 27, 2012 at 9:31 PM

sorry for the outcry, didn't mean to offend anyone. I do like reflection and use it a lot. But i don't think it's use is justified here. Local var name matching to term name - does not matter at all. You can name your variables x1, x2, x3, while keeping descriptive terminal names in constructors - Irony does not care. Maintaining grammar construction code would be a problem - for you as a programmer, but parser does not care. Local variable names disappear at runtime, even reflection metadata and IL does not keep them. And you can even name the terminals with whatever name you want - it does not matter. Irony does not look at these names, they are used only for displaying information (state table) in grammar explorer. So eventual mismatch is no problem at all, and certainly not worth doing reflection. 

Roman

Coordinator
Sep 27, 2012 at 9:42 PM

... that is if you are concerned about eventual mismatch.

If you're just 'lazy' :) to type "foo" twice... Well, I think it's not that big a deal, provided that mistakes do not do any harm.

To be honest, I was annoyed by this too in the early times, and tried to think up some way to avoid it, but did not come up with anything simple and decided to put it to rest.

Roman

Sep 27, 2012 at 10:28 PM
Edited Sep 27, 2012 at 10:28 PM

About the third time one loses 40 minutes  chasing down the wrong (mis-named due to an incomplete Copy/Paste) malformed non-terminal, one seriously contemplates using Reflection to eliminate the double typing. I was actualy considering what the effort in building a pre-processor would be.
;-)

More seriously: It is a wart on a beautiful face; but not worth cutting the nose off for.

Sep 27, 2012 at 11:08 PM

Roman, I want to be clear that I am deeply grateful that you have devoted your energy to this project, giving the world the gift of parsing. Thank you! It sure saves me a ton of work, which means I can give a gift of a Free PowerShell that works on Linux & Mac.

We're getting a little off topic from my original question (startup performance; the ball is my court to do the steps you laid out for me), but while we're here:

How do you inspect a parse tree?

if (parseTreeNode.ChildNodes[0] ???)

One way is to use the name field:

 

if (parseTreeNode.ChildNodes[0].Term.Name == "Foo")

But that requires duplicating the name in the grammar and the client code, and that is of course error-prone. I'm already good at making bugs without extra help!

 

My approach is to make the BnfTerms fields, not locals, so clients can compare them directly:

 

class MyGrammar : Grammar
{
    public readonly NonTerminal Foo;
}

/// ------------------

MyGrammar grammar = ...
ParseTreeNode parseTreeNode = ...

if (parseTreeNode.ChildNodes[0].Term = grammar.Foo)
{
    ...
}

So now the Name field of a BnfTerm is not important to me, but the name of the field containing it is important. One unfortunate side effective of this approach is that the Rule definition and the field declaration are far apart from each other.

Maybe there's an Irony idiom that I've missed. 

 

Sep 28, 2012 at 1:05 AM

I started looking at my grammar under GrammarExplorer, which I've never looked at before. Yes, it prompts with 5 grammar types, but only 1 of them legitimate - InteractiveInput.

80% or so of the grammar errors have to do with optional newlines. The PowerShell grammar is very particular with newline & whitespace treatment. Newlines are part of the grammar, but have no semantic purpose. When faced with a conflict around newlines, it doesn't matter what Irony picks. The non-newline conflicts are more interesting, but I guess I will have to work through them all to get GrammarExplorer to give me a green light.

Eventually I'll have to insert regular whitespace in to the grammar, as well, since there are a few places where whitespace is prohibited between tokens, e.g.:

PS> [System.Int32]::MaxValue       # OK
2147483647
PS> [System.Int32] ::MaxValue      # ERROR
Unexpected token
But I'm quietly ignoring that problem for now.

Sep 28, 2012 at 9:47 AM

You can take over explicit handling of all whitespace by adding this override to your Grammar definition:

public override void SkipWhitespace(ISourceStream source) { return; }

I have had to do this with my parser for ABC notation.

Sep 28, 2012 at 3:27 PM
pgeerkens wrote:

You can take over explicit handling of all whitespace by adding this override to your Grammar definition:

public override void SkipWhitespace(ISourceStream source) { return; }

I have had to do this with my parser for ABC notation.

I have overridden SkipWhitespce already, because PowerShell has some slightly unusual whitespace ideas.

whitespace:
    Any character with Unicode class Zs, Zl, or Zp
    Horizontal tab character (U+0009)
    Vertical tab character (U+000B)
    Form feed character (U+000C)
    `   (The backtick character U+0060) followed by new_line_character

new_line_character:
    Carriage return character (U+000D)
    Line feed character (U+000A)
    Carriage return character (U+000D) followed by line feed character (U+000A)

Normally newlines are *not* treated as whitespace, unless they're escaped, in which case a single whitespace can be 3 characters long (`\r\n).

However, there are 4 points where whitespace is prohibited in a production rule. 

member_access: Note no whitespace is allowed between terms in these productions.
    primary_expression   .   member_name
    primary_expression   ::   member_name

So I need a way to express that. The only way I know of right now is to add whitespace between every term in the entire grammar, except for the 4 locations where it's prohibited. That seems like a lot of trouble, so I'm putting it off until some more useful things get done.

Sep 28, 2012 at 7:57 PM
JayBazuzi wrote:

However, there are 4 points where whitespace is prohibited in a production rule. 

member_access: Note no whitespace is allowed between terms in these productions.
    primary_expression   .   member_name
    primary_expression   ::   member_name

That is surprising, as most C-style languages allow white-space there. If you èxpanded` the grammar to allow this, does it break something elseÉ

Coordinator
Sep 28, 2012 at 8:05 PM

about these 2 places where whitespace is prohibited. I suspect it's just the way the MS powershell parser works, just an accident. I think it's OK to allow whitespace in your parser (so it is more 'forgiving'). However, if you want to disallow it, I think the best way is to allow it in the grammar rules, but to verify it after parsing member_access. Just hook to 'Reduced' event of member_access, and check child nodes of the parsed node - verify that there's no gaps between. 

Roman

Sep 28, 2012 at 8:53 PM

That text about "no whitespace is allowed" is a quote from Microsoft's language spec, not something I invented. But maybe you're right - maybe I could be forgiving, at least for now.

I said there are 4 points with whitespace prohibition, but I only gave one of them. 

    generic_token_with_subexpr:
        No whitespace is allowed between ) and command_name.
        generic_token_with_subexpr_start   statement_list_opt   )   command_name

    member_access: Note no whitespace is allowed between terms in these productions.
        primary_expression   .   member_name
        primary_expression   ::   member_name

    element_access: Note no whitespace is allowed between primary_expression and [.
        primary_expression   [  new_lines_opt   expression   new_lines_opt   ]

    invocation_expression: Note no whitespace is allowed between terms in these productions.
        primary_expression   .   member_name   argument_list
        primary_expression   ::   member_name   argument_list

And I think I found a bug in the grammar which leads to a 5th, but I don't need to bother you with it right now. If you're really curious, see: https://github.com/JayBazuzi/Pash2/issues/8

Oct 3, 2012 at 12:30 AM

Just an FYI: it should now be much easier to compile Pash on a clean machine, mostly because I added NUnit in to the repository. You still have to install gtk#, or unload or ignore the project that depends on it.

Also, the static ctor is now gone. 

I haven't resolved all the performance issues, shift/reduce conflicts, reduce/reduce conflicts, and whitespace handling trickiness I raised earlier, because I've been busy, but I'm grateful for the guidance offered here, which will help when I work on those things.

Apr 7, 2013 at 10:07 PM
Edited Apr 8, 2013 at 1:15 AM
I worked on implementing whitespace prohibition as you suggested. As an experiment, I wrote this code:
 _member_access_or_invocation_expression.Reduced += delegate(object sender, ReducedEventArgs e)
{
    var primaryExpressionSpan = e.ResultNode.ChildNodes[0].Span;
    var operatorSpan = e.ResultNode.ChildNodes[1].Span;

    if (SourceLocation.Compare(primaryExpressionSpan.Location + primaryExpressionSpan.Length, operatorSpan.Location) != 0)
        throw new Exception("No whitespace allowed");
};
Sure enough, the exception does throw when there is a whitespace where it's not allowed. Great.

Is there a mechanism I should be using to turn this in to a parse error, instead of throwing an exception?
Coordinator
Apr 8, 2013 at 3:15 AM
e.Context.AddParserError(...)
Apr 8, 2013 at 3:33 AM
Thanks!