About Scanner abstraction - completely agree, we should refactored it into something easier customizable or replaceable. Your base class is a good starting point. I think we should start maybe even lower, with ultimate abstraction - seeing Scanner as just
a source of tokens, some ITokenStream interface. Then implement it at different levels, like you suggest. I will give it a serious thought, when I'm back to this area. Will definitely welcome any advice or feedback.
Separating interpreter/AST from parsing functionality - already done, will be in the next code drop. I agree, it will work better this way.
c preprocessor - big thing. Had been thinking lately about this as well. Before I comment on your solution, let's separate several distinct sub-types of preprocessor commands, which are really different:
1. conditional compilation commands like #ifdef (#if in c#) - assume we target logical expressions
2. macro definitions
3. macro expansions
4. everything else
So regarding particular design solutions, it's better to be precise about which subtype we're talking about. I assume you suggest to have an extra grammar for ALL of the subtypes. I would suggest something to handle subtypes separately. For conditional
compilation - let's allow to define an extra grammar for argument expressions, make it "pluggable", with some evaluation engine.
For #if version in c#, I think we don't need real interpreter with AST nodes. We can do it by evaluating directly the tree (iterating the tree) - just like SQL Search grammar does generating the SQL; the only difference is that we need to produce bool value
instead of string. So I suggest to have a preprocessor terminal into which you can hook a custom grammar/expr evaluator for conditional compilation. These would be different for c, c# or assembly.
Implementing skip/include source fragments. As far as I understood, you are planning to use a token filter for this, which would exclude token from "inactive" fragment. I don't think this would work. The problem is that the inactive fragment does not necesserily
contain parseable/scannable text, it might be any garbage text. So the "#if.." terminal should fast-forward source position beyond any inactive fragments, so that scaner never sees them and never tries to tokenize them.
The same goes for macro expansions - I do not think token filter may be used here. Macro expansion in c (and alike language) should be done in source text. You cannot tokenize in advance the expanded macro once and then reuse it. First, it might not be possible
to tokenize at all the text with "unexpanded" macro in the context of language grammar. Secondly, often scanner asks for parser context (current state) to figure out how to scan the current input. This is quite important facility which is essential in many
languages. With pre-tokenized macro you can't use this.
But expanding macro into plain text poses another problem - how to "embed" this generated fragment into the input text. Modifying the source text itself is obviously not a good idea - we'll start generating a lot of "very long" strings. The alternative can
be some "chained" SourceStream, when we create temp SourceStream containing macro expansion. This temp object is backed up by original source stream, so it lets scanner transparently go thru macro expansion, and step beyond it into original stream. This requires
rethinking ISourceStream and Source impl. I still don't know how to do it - would welcome ideas. One thing that must be done is removing Text property from the interface and replacing it with some methods that allow terminals to do the same stuff efficiently
enough. (For ex: Text is useful for searching some symbol, like end of comment */; so instead of exposing Text directly, we can expose SearchSymbol method or something). So, here's my take on this.
Having custom scanner for preprocessor - I don't think that would pay off. Even if this custom scanner is really fast, we won't feel this improvement, as it would be used on smth like 1% of the source lines.
About Style cop - not a big fan of all these tools. But I do understand the importance of consistent style and formatting - and I hope you'd agree I'm consistent in my codebase. Just making the tool enforce certain the style is a trouble - when you have
to consiously deviate from standard for some reason, or when you prototype or make quick and dirty run of something. That's why I don't use these tools and do not recommend to anybody. Thanks for the input, let's continue and crack this preprocessor finally.