Suggestion for custom Tokenizer/Scanner and Ast

Sep 9, 2011 at 9:52 PM

While investigating to implement preprocessor handling with Irony, I found that while the default scanner is suitable for most grammars, It is not always as efficient as a pure DFA tokenizer (or even a handwritten tokenizer). For example I was able to achieve a speedup of 50x by writing a highly specialized tokenizers and in certain condition, It is necessary to have this kind of custom scanner.

But I had to make some changes to default irony scanner in order to be able to plug my own scanner. It would be nice also If the Parser class to take an external scanner in the constructors instead of instantiating the default one.

I have implemented a ScannerBase to extract the core methods of scanner like this:

#region License
/* **********************************************************************************
 * Copyright (c) Roman Ivantsov
 * This source code is subject to terms and conditions of the MIT License
 * for Irony. A copy of the license can be found in the License.txt file
 * at the root of this distribution. 
 * By using this source code in any fashion, you are agreeing to be bound by the terms of the 
 * MIT License.
 * You must not remove this notice from this software.
 * **********************************************************************************/

using System.Collections.Generic;

namespace Irony.Parsing
    /// <summary>
    /// Scanner base class. The Scanner's function is to transform a stream of characters into aggregates/words or lexemes, 
    ///   like identifier, number, literal, etc.
    /// </summary>
    public abstract class ScannerBase
        #region Constructors and Destructors

        /// <summary>
        /// Initializes a new instance of the <see cref="ScannerBase"/> class.
        /// </summary>
        /// <param name="parser">The parser.</param>
        public ScannerBase(Parser parser)
            Parser = parser;
            Data = parser.Language.ScannerData;
            Grammar = parser.Language.Grammar;


            // create token streams
            var tokenStream = GetUnfilteredTokens();

            // chain all token filters
            Grammar.CreateTokenFilters(Data.Language, Context.TokenFilters);
            foreach (TokenFilter filter in Context.TokenFilters)
                tokenStream = filter.BeginFiltering(Context, tokenStream);

            Context.FilteredTokens = tokenStream.GetEnumerator();


        #region Public Properties

        /// <summary>
        /// Gets the scannar data.
        /// </summary>
        public ScannerData Data { get; private set; }

        /// <summary>
        /// Gets the parser.
        /// </summary>
        public Parser Parser { get; private set; }


        #region Properties

        /// <summary>
        /// Gets the context.
        /// </summary>
        protected ParsingContext Context
                return Parser.Context;

        /// <summary>
        /// Gets or sets the grammar.
        /// </summary>
        /// <value>
        /// The grammar.
        /// </value>
        protected Grammar Grammar { get; set; }


        #region Public Methods

        /// <summary>
        /// Begins the preview.
        /// </summary>
        public abstract void BeginPreview();

        /// <summary>
        /// Ends the preview.
        /// </summary>
        /// <param name="keepPreviewTokens">if set to <c>true</c> [keep preview tokens].</param>
        public abstract void EndPreview(bool keepPreviewTokens);

        /// <summary>
        /// Gets the next token.
        /// </summary>
        /// <returns>A Token</returns>
        public Token GetToken()
            // get new token from pipeline
            if (!Context.FilteredTokens.MoveNext())
                return null;

            var token = Context.FilteredTokens.Current;
            if (Context.Status == ParserStatus.Previewing)

            return token;


        #region Methods

        /// <summary>
        /// Gets the unfiltered tokens.
        /// </summary>
        /// <returns>An enumeration of unfiltered tokens.</returns>
        protected abstract IEnumerable<Token> GetUnfilteredTokens();

        /// <summary>
        /// Prepares the input.
        /// </summary>
        protected abstract void PrepareInput();


But this is quick patch is probably not ideal (For example, I kept the constructor in order to avoid a larger refactoring, but constructor needs to call virtual method to initialize. Probably It would be better to have an overridable Initialize method on ScannerBase?)

About a c preprocessor implementation in Irony, in order to integrate an efficient preprocessor handler, I tried so far the following to make it work:

  • First I defined two grammars: One for the language to be parsed, one for the preprocessor.
  • The preprocessor grammar is only able to process a single preprocessor line. This grammar is creating a simple Ast for this line (Ast is mostly usefull to evaluate the if (condition) preprocessor directive for example). This parser is using a custom scanner in order to achieve greater performance.
  • The language grammar has a terminal that is able to match a preprocessor line as a whole token (a preprocessor line can be spawned into multiple lines).
  • A specialized preprocessor token filter is used in the language grammar using internaly the preprocessor parser/grammar to decode preprocessing directive and to analyze the flow of tokens in order to perform macro expansions...etc. The token filter is also responsible to map tokens retrieved from the preprocessor grammar and convert them to the language grammar (For example, the preprocessor doesn't care about the keyword "true" and treat it as a pure "text identifier", but this token true must be mapped to the tokens used by the language grammar, like boolean terminal or whatever...)

But I'm not entirely sure that this is the correct way to integrate it in Irony (both for the Scanner and the Preprocessor) , what do you think Roman?

Concerning the Ast, the basic infrastructure AstNodeCreator is mostly what I'm using. It would be nice to move most of the default Ast implementation to an external assembly in order to make Irony more modular. Then Irony core parser assembly would have the minimal infrastructure to build an Ast (mostly the IBrowsableAstNode and probably Ast for literal values).

Btw, Irony is great, I'm a huge fan from the beginning, but It definitely needs more love to code comments and code layout! ;p... don't know if you could add a suitable integrated ReSharper/GhostDoc/StyleCop into your toolchain, but that could be worth! ;)

Sep 12, 2011 at 4:34 AM


About Scanner abstraction - completely agree, we should refactored it into something easier customizable or replaceable. Your base class is a good starting point. I think we should start maybe even lower, with ultimate abstraction - seeing Scanner as just a source of tokens, some ITokenStream interface. Then implement it at different levels, like you suggest. I will give it a serious thought, when I'm back to this area. Will definitely welcome any advice or feedback. 

Separating interpreter/AST from parsing functionality - already done, will be in the next code drop. I agree, it will work better this way. 

c preprocessor - big thing. Had been thinking lately about this as well. Before I comment on your solution, let's separate several distinct sub-types of preprocessor commands, which are really different:    

1. conditional compilation commands like #ifdef (#if in c#) - assume we target logical expressions 

2. macro definitions 

3. macro expansions 

4. everything else  

So regarding particular design solutions, it's better to be precise about which subtype we're talking about. I assume you suggest to have an extra grammar for ALL of the subtypes. I would suggest something to handle subtypes separately. For conditional compilation - let's allow to define an extra grammar for argument expressions, make it "pluggable", with some evaluation engine.

For #if version in c#, I think we don't need real interpreter with AST nodes. We can do it by evaluating directly the tree (iterating the tree) - just like SQL Search grammar does generating the SQL; the only difference is that we need to produce bool value instead of string. So I suggest to have a preprocessor terminal into which you can hook a custom grammar/expr evaluator for conditional compilation. These would be different for c, c# or assembly.

Implementing skip/include source fragments. As far as I understood, you are planning to use a token filter for this, which would exclude token from "inactive" fragment. I don't think this would work. The problem is that the inactive fragment does not necesserily contain parseable/scannable text, it might be any garbage text. So the "#if.." terminal should fast-forward source position beyond any inactive fragments, so that scaner never sees them and never tries to tokenize them. 

The same goes for macro expansions - I do not think token filter may be used here. Macro expansion in c (and alike language) should be done in source text. You cannot tokenize in advance the expanded macro once and then reuse it. First, it might not be possible to tokenize at all the text with "unexpanded" macro in the context of language grammar. Secondly, often scanner asks for parser context (current state) to figure out how to scan the current input. This is quite important facility which is essential in many languages. With pre-tokenized macro you can't use this. 

But expanding macro into plain text poses another problem - how to "embed" this generated fragment into the input text. Modifying the source text itself is obviously not a good idea - we'll start generating a lot of "very long" strings. The alternative can be some "chained" SourceStream, when we create temp SourceStream containing macro expansion. This temp object is backed up by original source stream, so it lets scanner transparently go thru macro expansion, and step beyond it into original stream. This requires rethinking ISourceStream and Source impl. I still don't know how to do it - would welcome ideas. One thing that must be done is removing Text property from the interface and replacing it with some methods that allow terminals to do the same stuff efficiently enough. (For ex: Text is useful for searching some symbol, like end of comment */; so instead of exposing Text directly, we can expose SearchSymbol method or something). So, here's my take on this. 

Having custom scanner for preprocessor - I don't think that would pay off. Even if this custom scanner is really fast, we won't feel this improvement, as it would be used on smth like 1% of the source lines.

About Style cop - not a big fan of all these tools. But I do understand the importance of consistent style and formatting - and I hope you'd agree I'm consistent in my codebase. Just making the tool enforce certain the style is a trouble - when you have to consiously deviate from standard for some reason, or when you prototype or make quick and dirty run of something. That's why I don't use these tools and do not recommend to anybody. Thanks for the input, let's continue and crack this preprocessor finally.

Sep 12, 2011 at 1:35 PM

Concerning the Scanner abstraction, what I did is removing the reference to SourceStream in the ParsingContext. The abstract Scanner doesn't make any reference to a SourceStream. It has a method to retrieve and set the location, set the source text...etc. I didn't use a minimal implementation as the preview and token filters can be useful as a general behavior...

So far, I have been able to plug a GoldParser DFA scanner inside Irony, and It's working very well, with a performance boost of more than x5 while parsing files. When I will have some times, I would love to have a DFA like scanner expressed in the same way the non-terminal rules are expressed in Irony (with the concept of character set as it is used in GoldParser). What I did also is to setup the required scanner inside a grammar (I setup a property Scanner from the grammar), that is later retrieved by the Parser to get the scanner and set the source text.

Concerning the preprocessing, I had to read the C standard (link to ISO/IEC 9899:TC2). They are talking about translations phases (

  1. Physical source file multibyte characters are mapped, in an implementation defined manner, to the source character set [...]. Trigraph sequences are replaced by corresponding single-character internal representations.
  2. Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such
    splicing takes place.
  3. The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.
  4. Preprocessing directives are executed, macro invocations are expanded, and _Pragma unary operator expressions are executed. If a character sequence that matches the syntax of a universal character name is produced by token concatenation (, the behavior is undefined. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively. All preprocessing directives are then deleted.
  5. Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation defined member other than the null (wide) character.
  6. Adjacent string literal tokens are concatenated
  7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.
  8. All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

6.4 section defines also how the preprocessor tokens are defined.

Also if you look at the preprocessor grammar (6.10), they are indeed talking about "pp-tokens" which is a sequence of preprocessor tokens A small extract:

    # include pp-tokens new-line
    # define identifier replacement-list new-line
    # define identifier lparen identifier-listopt ) replacement-list new-line
    # define identifier lparen ... ) replacement-list new-line
    # define identifier lparen identifier-list , ... ) replacement-list new-line
    # undef identifier new-line
    # line pp-tokens new-line
    # error pp-tokensopt new-line
    # pragma pp-tokensopt new-line
    # new-line
    pp-tokensopt new-line

So basically, preprocessor tokens are part of the process, even before macro expansion. When a macro expanssion is occuring, tokens are expanded but they are converted back to text for an immediate retokenization (

After all parameters in the replacement list have been substituted and # and ## processing has taken place, all placemarker preprocessing tokens are removed. Then, the resulting preprocessing token sequence is rescanned, along with all subsequent preprocessing tokens of the source file, for more macro names to replace.

So It is important to have a custom scanner for preprocessor, because this scanner is going to tokenize the whole source code. From my experiment, pure DFA scanner is a lot more (>> x10) efficient compare to the default one in Irony, and when parsing large file (or frequently the same file, when using it in a syntax highlighting plugin), It becomes very important.

What I did is that the scanner is using internally preprocessor tokens, but  when a preprocessor token is recognized, it is mapped to the language token and expected terminal. For example, if there is a keyword "public" in the language, the preprocessor token retrieved internally is the identifier "public" but It is given back to Irony parser as a "public" keyword. The mapping between the preprocessor token and the language grammar is partially defined in the grammar and finalized in the parser initialization (when the parser initialize the Scanner with the Initiailize(Parser) method - I removed the constructor with (Parser) argument in order to support this )

Sep 12, 2011 at 1:47 PM

But I agree that most of the process of running the preprocsesor should be done inside the scanner and not inside irony token-filter, in order to work directly with preprocessor tokens (and not language token that are going through irony token-filters...)