Error recovery and partial AST?

Oct 22, 2011 at 9:38 AM

Hi Roman,

I have been recently working on a full parser for HLSL/GLSL shader languages using Irony (by using also goldparser for the lexer) with an output Ast, semantic analysis (type matching, method overload matching... etc.), code rewriting... so far, results are great. It allows me for example to share a core abstract grammar/Ast building for both languages (grammar inheritance is fantastic in this case!), making things a lot easier after.

But now, that I would like to build a nice code editing plugin for Visual Studio, with intellisense, completion, error... etc. I'm having some trouble on how to access partial tree in case of parsing errors successfully recovered (using ErrorRule and SyntaxError hint). When I look at C# under VS with ReSharper (I heard It was developed with ANTLR), their parser is able to handle error while still being able to give completion, access to methods definition, parameters, local parameters... etc.

I'm not sure if there is any good practical way to handle such a common issue in syntax highlighting with intellisense. Having access to a partial AST could be easier, but I'm not sure It is possible with Irony, or even It is the correct way to handle this...

Do you have any idea on how to solve this kind of problem?

Coordinator
Oct 23, 2011 at 7:30 PM
Edited Oct 23, 2011 at 7:31 PM

Hi

I understand your concern about error recovery. It is certainly not up to the task for intellisense. Here's some explanation of what's there currently. 

The error recovery was implemented following the classic recipes (Dragon book and others), which mostly target batch-mode compilation. When the main purpose is to leave error spot in more-less stable state and move on to the rest of the source file in attempt to catch more errors. The basic technique is the error rule, which basically says "try skip forward to the next 'End' symbol, which ends the 'statement'", and try starting again. Like in c#, the key symbols are ";" and "}". That works ok when compiling long files, but does not when trying to parse a live copy being edited by the user.

As an example, let's say user is typing c# class, from the beginning:

namespace ns { 

    public class MyClass {   

        public MyClass() {  

             <- user is here  

when user stops typing here (without typing closing braces for method/class/namespace) and tries to "invoke" intellisense with Ctrl-space, then Irony's recovery would fail. The rule "find closing } or ; " will not find anything.

So, we need something better, some method that can reasonably well round-up the unfinished constructs, and produce a parse tree (however mis-shaped) and possibly an AST tree. 

Apparently, we need an algorithm that can inject missing tokens. It looks like that's the way it work in VS for c# - look at just published Roslyn project preview document (http://msdn.microsoft.com/en-us/hh500769); they have a special type for these injected "compensating" tokens. 

I would like to proceed with this and implement this enhanced recovery mechanism. Here's the way I see it.

First, the task is to produce some parse tree (no AST), even if in some really bad shape. The second task is to build AST tree from this mis-shaped AST. These appear to me like more-less independent tasks.

But before all that, I need to complete the separation of AST construction from parsing. Currently, the AST nodes are built right during the parsing process - which does not seem quite right at the end of the day, for many reasons. So I planned, and actually started taking away AST production from parsing, and putting it into a separate process. Once this complete (hopefully very soon, in the next or after-next code update), we can start looking at error recovery. As for particular mechanisms for extra-token injection (deciding which token to insert), I think we can use two facilities in the parser. One is "expected" tokens list in each parser state (this is used in producing error report, a list after "expected: ..."). The other is a stack of open braces (parenthesis) that had been opened but not yet closed - this will in fact save us in the situation in c# editor described above. 

As for timing, when it can be done - can't promise anything. Hopefully soon, this year. And let's start experimenting with facilities to support VS integration - a separate Irony.Integration assembly?

Thank you

Roman 

Oct 23, 2011 at 10:26 PM

I haven't played with his project or know if it even works, but Ben Cooley had started a sub-project at the end of last year to create Visual Studio Integration for Irony.  The source for his project can be found here.. https://github.com/benjcooley/sable-fx/tree/master/Source/VsIntegration. I also doubt that this is based on the most recent release of Irony, but it's probably a good start for an official implementation.

Coordinator
Oct 24, 2011 at 3:29 AM

thanks for the link. appears to be non-moving, the last checkin in Jan. Will look at it

Oct 24, 2011 at 9:25 AM

Hi,

That's great that you have already lots of ideas on how to integrate it!

Yep, I knew about Roslyn but didn't take the time to look at the doc... and that's exactly the kind of problem we have here. One thing is that not only they have a feature to include missing tokens, but also, they are talking about a second step to skip unexpected tokens. Even if you need to skip to a certain amount of tokens, it is not exactly as skipping to next ";" or "}", so probably we need something more elaborated here.

For example, in c# if you type:

 

public void toto()
{
    toto x e z f g
    int x;
}

 

It will be able to see that "toto x" is like starting a new variable declaration with an unknown type "toto" and e z f g are garbage tokens and thus skipped. Then it parses "int x" as another valid declaration statement and complains about the existence of the same x variable declared in the scope...

Concerning Visual Studio integration, when working with MEF extensions introduced in VS2010, It's fairly easy to plug into it... Older API required to implements a language service which was quite painful to setup correctly. Also, I'm not confident that Irony should provide an infrastructure for this... It is not the most difficult part of the problem.

I will try also to setup a repository clone of Irony, as I did a couple of dirty hacks (like abstracting the Scanner) that could prevent me from being able to contribute to it or merge from your future work...

Thanks!

Coordinator
Oct 27, 2011 at 5:27 PM

There's a lot of space for improvement in this area, no doubt. I know that in VS 2010 language integration had been simplified greatly but did not look into the details yet. Still, I think it would be useful to provide some out-of-the box support in Irony for this - with a separate project or just a few classes, we'll see.

I would also love to have a look at your scanner, and see if I can provide something similar (like very-fast-scanner) in Irony. The trouble with these scanners (with source generated from regex and alike) is that they work well for "simplified" terminals (like identifier as simply sequence of letters or digits starting with letter), but are increasingly difficult in real-language cases when you have extras like char escaping, prefixes and suffixes that alter the behavior - see c# identifier definition. Still, considering that most identifiers in real-life programs are really "simple", it might be worthy creating some super-fast reduced version of scanner and make it fall back to full-scanning mode when it detects "complications".

The trouble now is that I"m really swamped at work (and busy at home too :), don't know when I can spare enough time for seriously digging this stuff.

thanks

Roman

Oct 30, 2011 at 2:18 PM

For the scanner I'm using, It is simply a DFA scanner used with the result of a Gold Parser grammar that is similar to this one:

{String Char} = {Printable} - ["]
{Quote} = ['']
{IdLetter} = {Letter} + [_$]
{IdAlphaNumeric} = {Alphanumeric} + [_$]
{HexDigit} = {Digit} + [abcdefABCDEF]
{OctalCharSet} = [01234567]
{NoZeroDigit} = [123456789]
{LongTypeSuffix} =[lL]
{FloatTypeSuffix} =[dfDF]
{ExponentPartIndicator} = [eE]
{Sign} = [-+]
{CharSign} = [abtnfr"\] + {Quote}
{CharSign1} = {String Char} - [\]
{HexEscapeSign} =[uUxX]
{WS}        = {Whitespace} - {CR} - {LF}

Identifier     = {IdLetter}{IdAlphaNumeric}*
StringLiteral  = '"'( {String Char} | '\'{Printable} )*'"'
FloatingPointLiteral   = {Digit}+'.'{Digit}*{FloatTypeSuffix}? | {Digit}+{FloatTypeSuffix} | '.'{Digit}+{FloatTypeSuffix}?
FloatingPointLiteralExponent = {Digit}+'.'{Digit}+{ExponentPartIndicator}{Sign}?{Digit}+{FloatTypeSuffix}? | {Digit}+{ExponentPartIndicator}{Sign}?{Digit}+{FloatTypeSuffix}? | '.'{Digit}+{ExponentPartIndicator}{Sign}?{Digit}+{FloatTypeSuffix}?
IndirectCharLiteral = {Quote}{CharSign1}{Quote}
StandardEscapeCharLiteral = {Quote}'\'{CharSign}{Quote}
OctalEscapeCharLiteral ={Quote}'\'{OctalCharSet}+{Quote}
HexEscapeCharLiteral ={Quote}'\'{HexEscapeSign}{HexDigit}+{Quote}
StartWithNoZeroDecimalIntegerLiteral = {NoZeroDigit}{Digit}*{LongTypeSuffix}?
StartWithZeroDecimalIntegerLiteral = '0'{LongTypeSuffix}?
HexIntegerLiteral = '0'('x'|'X'){HexDigit}+{LongTypeSuffix}?
OctalIntegerLiteral = '0'{OctalCharSet}+{LongTypeSuffix}?

The DFA scanner is very efficient and is able to parse 40,000 of tokens in 10ms, which is perfect for VS syntax highlighting, as the 1st tokenizer pass must be really efficient (this is the first pass you get when you load a source file and It is directly highlighted. 2nd pass is often parsing and semantic analysis (that can be also a pass in itself), but they are run in a separate thread after the user is stopping from modifying the source code (or by pressing special keys like "." to open members...etc.). When profiling Irony, I found that most of the time is consumed by the scanner, so that's one of the most critical part for performance.

Also, as you can see the way GoldParser is describing its lexical rules is quite versatile. Basically, It would be great for Irony to be able to specify the lexical grammar in a similar way we are specifying bnf rules for parsing. I'm sure that there is a lot of cases that could be efficiently handled that way.

At least, the first step is to be able to plug any kind of tokenizer in Irony. In order to support a pluggable Scanner, I have currently a custom LanguageData that is inherited from Irony.Parsing.LanguageData. I need this to pass special data from my grammar to the goldparser scanner. Then I have a basic GoldParser tokenizer that is plugged into a Scanner implem. I put a virtual method on the language data to return an instance of a Scanner for this particular language data. This is not ideal, but that was the easiest way to plug it.

Coordinator
Nov 2, 2011 at 5:45 PM

thanks for the info.

From what I see - you can specify escapes in strings, but it does not provide the "meaning" of escape sequence, so the output string contains the raw (not replaced) escaping sequences, right?

who's responsible for turning escapes (and unicode numbers) into chars? - you have to create some after-parse method yourself?

Another thing - how to specify more advanced things in Identifiers (see c# identifier), like Unicode groups (instead of simply "alphanumeric")? escapes? does this identifier allow use of non-English characters like c#? 

Not trying to discredit anything, just thinking out loud, how well this can be used to define tricky terminal sequences found in languages like c#.

I would like to have a peek at generated Scanner file if possible, maybe at your adjusted for Irony version, to understand why it is in fact faster. If confidentiality is an issue - can you send me these privately through contact user - I promise not to publish/disclose it. Just to have a look.

thanks

Roman

Nov 3, 2011 at 8:40 AM
rivantsov wrote:

thanks for the info.

From what I see - you can specify escapes in strings, but it does not provide the "meaning" of escape sequence, so the output string contains the raw (not replaced) escaping sequences, right?

who's responsible for turning escapes (and unicode numbers) into chars? - you have to create some after-parse method yourself?

Indeed, I'm creating raw terminals (and I removed all custom terminals defined by Irony as I don't use them) and decode this in custom AstNodeCreators, though the code is pretty tiny... For the escapes sequences, they are in fact not so much used in HLSL, as strings are only used for annotations. Also decoded strings is not really important in my case, as I'm not using the AST to run an interpreter but to perform semantic analysis, code transformation, reoutput code with comments as original code...etc. So I need the raw strings as they were defined but I could also run a semantic analysis for the content of the string in order to check its consistency...

rivantsov wrote:  Another thing - how to specify more advanced things in Identifiers (see c# identifier), like Unicode groups (instead of simply "alphanumeric")? escapes? does this identifier allow use of non-English characters like c#? 

Not trying to discredit anything, just thinking out loud, how well this can be used to define tricky terminal sequences found in languages like c#.

I'm not sure that GoldParser is providing support for Unicode Groups, though it is supporting unicode, but I guess that you need to specify the range yourself.

Also the concept of character set would be quite easy to port to Irony and to extend it to support unicode groups... etc.

Anyway, the way GoldParser is defining character set/rules for tokens is not the "golden way", but I'm pretty sure that Irony could leverage on this concept to write lexer rules for tokens instead of having custom C# class for matching characters and allowing a more efficient DFA parser for tokens.

 

rivantsov wrote:

I would like to have a peek at generated Scanner file if possible, maybe at your adjusted for Irony version, to understand why it is in fact faster. If confidentiality is an issue - can you send me these privately through contact user - I promise not to publish/disclose it. Just to have a look.

Sure, I will try to setup a compare sample next week. I don't remember exactly the performance difference, but for only tokenizing, It was much faster (like above x10) for a DFA parser, and using it inside Irony, letting Irony matching grammar rules and creating AST, I had like a x4-5 increase in speed, which is really important at this level (having a large source code tokenized in less than 10ms instead of 100ms is giving more time for grammar parsing, semantic analysis...etc.)

Coordinator
Nov 3, 2011 at 5:22 PM

About defining terminals in Gold-style, using rules - don't see if it's really necessary. The fact is - there are only a few terminal kinds (identifier, number, quoted string) and Irony provides these out of the box, with many customization options,

which I suspect Gold-style cannot match - at least some of them. So what's the point of switching to describing IdentifierTerminal with relatively complex set of rules if you can just say "new IdentifierTerminal(...)" ?

Where we fall behind is in Scanner performance, which I think we can attack independently, and find some solution to speed up for inputs that don't involve fancy stuff. I'm looking forward to your sample and let's see what we can do to match the speed.

Roman