P6 - Heredoc Terminals

Aug 20, 2011 at 8:31 PM
Edited Aug 20, 2011 at 8:32 PM

I've implemented Heredoc terminals for Irony.  The implementation works by modifying the source, and follows the Ruby standard shown in the example.  I ran a series of tests through irb to ensure the behavior was correct to the way Ruby handles it.  I found the following set of rules for heredocs in ruby:

1) The start tag (<<TOKEN) counts as a string on the same line, which is why you can start more than one heredoc and intersperse another string between them using concatenation.

2) The text for the heredoc must start of the line following the start tag, it cannot be on the same line as the start tag.

3) When more than one heredoc is started on a line, their ending tokens must follow the same pattern, i.e, <<BEGIN + <<END would need the BEGIN end tag before the END end tag.

This led me to believe that Ruby parses the expression <<BEGIN + "<--- this is the middle --->\n" + <<END

This is the beginning.

BEGIN

And now it's over!

END

Like this:

Encountering the <<BEGIN tag, we first retrieve the token, which is everything after the start symbol and before the first space, and then store the location right after the entire start tag.  We then skip the end of the line, and begin processing the heredoc there.  We check for the presence of the end tag based on whether or not indentation is allowed, and if it is not encountered, we add the entire line to the heredoc and continue processing at the next line.  Errors happen when the end tag is not encountered before EOF, and this includes the case where the END tag is indented on a heredoc that does not allow it.

The implementation can be found at http://bitbucket.org/shadowphoenix/irony in Irony/Parsing/Terminals/HeredocTerminal.cs

 

You create it by calling new Heredoc(string name) which creates a default non-indenting heredoc with the start symbol "<<", or new Heredoc(string name, string startSymbol, HeredocFlags flags) with the specified name and start symbol.  The flags are either HeredocFlags.None or HeredocFlags.AllowIndentedEndTag.  A few tests are included to test it works for simple heredocs including error processing.

Coordinator
Aug 21, 2011 at 12:26 AM

Hi

this is great! I will look at this as soon as I can.

thank you!

Roman 

Coordinator
Aug 21, 2011 at 2:21 AM

Got a look at it, and I'm afraid that wouldn't work. Replacing source text is not a good idea. Aside from issues of efficiency (creating long strings - subcopies of source text is bad enough) - the problem is that replacing source text messes up token positions for all source after the hereDoc. For example, if there's a syntax error in the following text, it's reported position would point to a wrong position in the original text. 

One small thing that scared sh... out of me - file header in HeredocTerminal.cs file. That's a true open source project, no MS-copyrighted stuff please. I hope that's just an oversight, your VS template probably created this header by default. 

Other things

  1. You don't need to inherit from CompoundTerminalBase, you don't use any of the functionality in this class. Inherit from Terminal
  2. Terminal for created token - do not use this static singleton. The whole point of using different OutputTerminal (other than "this") is to make the token seem the same as some other "regular" terminal, like StringLiteral already defined in the grammar. So by default use this.OutputTerminal (it is initialized to the terminal itself). Grammar writer can assign reference to other string literal if he needs. Alternatively, you can add constructor with outputTerminal parameter.
  3. Do not use underscore prefix for local variables (_token)
  4. Change HeredocFlags to HereDocOptions, to be consistent with other terminal option enums. Do not check _flags field for equality (if (_flags == ...), check individual flag using HasFlag method. It works for a "one-value" enum you have now, but will stop working if you add more flags. 
  5. Change casing to HereDocTerminal
  6. As for "allow indent option". As far as I understand, grammar writer would have to define 2 HereDocTerminals for language like Ruby. I would prefer one terminal with a list of startSymbol/options pairs. Look at StringLiteral and StringSubType option class.
  7. Remove Init method, it does not add anything
  8. Good practice: when you have more than one constructor, make only one (main) constructor call "base" constructor; all other constructors should call this "main" constructor.

Anyway, thanks so much, hope to see a new version soon!

Roman

 

 

Aug 21, 2011 at 7:33 PM
Edited Aug 21, 2011 at 7:58 PM

The header in the file was a template and an oversight, I'm terribly sorry for that. One issue I have though is how do we handle the Ruby style heredoc's without replacing the source?  Does irony have a way to mark tokens as consumed?

 

For example, in the text:

 

<<BEGIN + "<--- this is the middle --->\n" + <<END
This is the beginning.
BEGIN
And now it's over!
END

The terminal would consume

<<BEGIN This is the beginning. BEGIN

, but the offset would be set back to the + sign.  The + plus sign, the following literal, and the next plus sign would be treated okay, but our END heredoc would end up like this:

<<END
This is the beginning.
BEGIN
And now it's over!
END

Giving our end heredoc the value:

This is the beginning.
BEGIN
And now it's over!

This isn't so bad when facing simple heredocs, but given that the tokens have already been consumed, how can I tell another terminal to skip that text without removing it?

I have updated the code with your other suggestions though, and appreciate you looking it over for me.

Coordinator
Aug 21, 2011 at 8:59 PM

Well, looks like you should keep some variable that keeps the position of already consumed here doc lines, I guess. That's the thing to figure out. You can keep this variable in ParsingContext, there's a dictionary there for shared variables.

Alternatively, you can try to consume all hereDocs from one line in one hereDocTerminal.TryMatch call - then you have to handle concatenation with the middle part inside this call.

Aug 21, 2011 at 9:36 PM

Thank you for the suggestion.  I now store each line that gets consumed by a heredoc with the text and offset, and during processing I check to see if that line is already consumed, and if not, consume it and add it.  I also store the offset and text of each line containing an ending tag so that the tag is also skipped to avoid the scenario shown in my post above.  The new code is up on my bitbucket if you would like to take a look at it.

Coordinator
Aug 22, 2011 at 3:21 AM
Edited Aug 22, 2011 at 6:39 AM

 

Looks good! But there are some problems:

  1. Don't sort subtypes in TryMatch, do it once in Init method
  2. Don't use Dict.ContainsKey followed by getItem, use TryGetValue method. Refactor reading ConsumedLines dictionary into a separate method.
  3. When matching subtype, instead of "source.Text.IndexOf(s.Start, source.PreviewPosition) == source.PreviewPosition" - use source.MatchSymbol. IndexOf involves search thru the whole text
  4. Rename AddStart method to AddSubType
  5. Remove outputTerminal field, there's already OutputTerminal public field in Terminal, use it. Assume it is set by grammar writer or has default value (==this)
  6. Remove endOfLineMarkers, use context.Grammar.LineTerminators instead
  7. All private fields should start with "_"
  8. When you find end of line, after locating the terminator (\r or \n), take into account that terminator might be in fact both '\r\n' (standard in Windows text files I think) - so you'd need an extra Pos++ to reach the start of the next line.
  9. When you try to find indented endTag, you actually try to find it anywhere inside the line - even if it is inside the sentence and preceeded by some non-space chars. What you should do I think is Left-Trim the line and check if it starts with endTag. 
  10. When you try to search for a substring in text, you should take into account case sensitivity of the language (Grammar.CaseSensitive) - it might be not necessarily Ruby.
  11. Avoid things like "line.Substring(0, tag.Length) == tag" - Substring generates a new string, which is an extra garbage for GC. Use source.MatchSymbol or string methods that check symbol at position without generating new strings.
  12. Method LongerStartFirst - use String.Compare static method.
  13. At the very end of TryMatch you use "newPos" variable to set source.PreviewPosition - and it makes it a bit unclear; you have to scroll way back to the place you set it and then understand what is the value. Suggest to rename newPos to something more explicit. 
  14. General note. I would kindly ask to alter the style and avoid this multiple "if" in TryMatch. There's some repeating code there - either collapse in one fragment or refactor into a separate method. In general, when you have a fragment that does some isolated operation, it is better to refactor it out into a separate method - so the method name would clearly identify what the fragment is doing, and the whole code structure would become more clear. I advocate code with many small methods, rather than few big ones.  

General algorithm question. 

 I do not quite get the logic around using ConsumedLines list. As far as I understand, you have to keep track of all lines you already included in previous HereDocs, and skip them while you're reading lines for current HereDoc. So it seems that instead of HereDocConsumedLine object, you might keep track of all consumed lines in a HashSet<int> that contains all first positions of consumed lines. The remaining issue is the following: when you finish processing a line containing start tags of HereDocs, you must force Scanner to jump over all consumed lines. Note that this line with start tags might not necessarily end with a HereDoc start tag - it might be a string literal for example. So the problem is - how to catch the moment when scanner finishes the line, and then force a jump over all consumed lines. I don't know the answer out of my head, and suggest to do some research around scanner workflow. Either find an event that HereDoc code can hook into, or suggest some new event. Maybe hook to NewLineTerminal? It is singleton terminal in Grammar, it will be used to produce the final NewLine token, so at this moment you can do the jump.

Good job so far, you're on the right track!

thank you

Roman  

Aug 22, 2011 at 4:29 PM

I'm working on refactoring the code to your suggestions, and thanks for the suggestion about NewLineTerminal.  That was an issue I was aware of, the case where we have something like

<<BEGIN + "<--- this is the middle --->\n"
This is the beginning
BEGIN

The scanner would be reset to the + as before, but after parsing the string, it would be at the This, which was already consumed.  I'll take a look at Scanner workflow and hooking an event to reset the preview position after a line containing a heredoc is processed.

Coordinator
Aug 22, 2011 at 4:40 PM

A couple of thoughts. Each terminal has ValidateToken event, fired when token is produced. So you can hook to this event (Grammar.NewLine.ValidateToken += (handler in HereDoc)). The handler will be invoked when NewLine token is produced - this is the time to force jump scanner over all consumed lines. 

Another thing. Looks like you don't even need to keep dictionary of consumed lines or "first positions" of these lines. You can keep just the position of "beyond consumed lines", the start of the first non-consumed line. Just an int, instead of a dictionary.

Good luck

Aug 22, 2011 at 7:25 PM

I have a version up on bitbucket if you'd like to take a look at it.  I hooked into ParsingContext.TokenCreated, if ValidateToken would be better, let me know.  And I will change the code to store the start of the first non-consumed line which would simplify everything quite a bit.

Aug 22, 2011 at 9:06 PM

I changed it to use ValidateToken.  Also, I did some debug runs when the tests were failing, and I did a check for \r\n like you suggested, and it seemed to not matter.  Keep in mind that I developing this on Windows, where \r\n is the standard newline, and I never saw the extra +1 being needed.

Aug 27, 2011 at 6:18 PM

I have what I believe to be a working version up on bitbucket.  It uses a combination of a newline token event, storing the next line after a heredoc, plus the ability to force a preview position on the scanner to work.  The changes to the scanner and source stream were minimal with just the addition and honoring of the property to force a specific preview position.

Coordinator
Aug 27, 2011 at 6:20 PM

thanks, will look at this!

Aug 28, 2011 at 3:33 PM

Okay, I can completely attest to the version now on bitbucket as working.  The following is the test grammar, followed by the test cases with the resulting parse tree and combined string.

 

var heredoc = new HereDocTerminal("HereDoc", "<<", HereDocOptions.None);
heredoc.AddSubType("<<-", HereDocOptions.AllowIndentedEndToken);
var @string = new StringLiteral("string","\"");
var program = new NonTerminal("program");
program.Rule = heredoc + @"+" + @string + this.NewLine + @"+" + @string 
             | heredoc + @"+" + heredoc + @"+" + @string + this.NewLine 
             | heredoc + @"+" + @string + this.NewLine 
             | heredoc + @"+" + @string + @"+" + heredoc 
             | heredoc + @"+" + heredoc 
             | heredoc;
this.Root = program;
this.MarkPunctuation("+");

Test 1:
p.Parse(@"<<HELLO + ""<--- this is the middle --->\n""
This is the beginning.
It is two lines long.
HELLO 
+ ""And now it's over!""");

<ParseTree>
  <Node Term="program">
    <Node Term="HereDoc" Terminal="HereDocTerminal" Value="
This is the beginning.
It is two lines long.
" />
    <Node Term="string" AstNodeType="LiteralValueNode" Terminal="StringLiteral" Value="<--- this is the middle --->
" />
    <Node Term="string" AstNodeType="LiteralValueNode" Terminal="StringLiteral" Value="And now it's over!" />
  </Node>
</ParseTree>

This is the beginning.
It is two lines long.
<--- this is the middle --->
And now it's over!

Test 2:
p.Parse(@"<<HELLO + ""<--- this is the middle --->\n"" + <<END
This is the beginning.
It is more than two lines long.
It is three lines long.
HELLO 
And now it's over!
But we have three lines left.
Now two more lines.
Oops, last line! :(
END");

<ParseTree>
  <Node Term="program">
    <Node Term="HereDoc" Terminal="HereDocTerminal" Value="
This is the beginning.
It is more than two lines long.
It is three lines long.
" />
    <Node Term="string" AstNodeType="LiteralValueNode" Terminal="StringLiteral" Value="<--- this is the middle --->
" />
    <Node Term="HereDoc" Terminal="HereDocTerminal" Value="
And now it's over!
But we have three lines left.
Now two more lines.
Oops, last line! :(" />
  </Node>
</ParseTree>

This is the beginning.
It is more than two lines long.
It is three lines long.
<--- this is the middle --->
And now it's over!
But we have three lines left.
Now two more lines.
Oops, last line! :(

Test 3:
p.Parse(@"<<HELLO + <<END
This is the beginning.
How are you doing?
HELLO 
I'm fine.
And now it's over!
END");

<ParseTree>
  <Node Term="program">
    <Node Term="HereDoc" Terminal="HereDocTerminal" Value="
This is the beginning.
How are you doing?
" />
    <Node Term="HereDoc" Terminal="HereDocTerminal" Value="
I'm fine.
And now it's over!" />
  </Node>
</ParseTree>

This is the beginning.
How are you doing?
I'm fine.
And now it's over!

Test 4:
p.Parse(@"<<HELLO
This is the beginning.
I hope you enjoyed these tests.
HELLO");

<ParseTree>
  <Node Term="program">
    <Node Term="HereDoc" Terminal="HereDocTerminal" Value="
This is the beginning.
I hope you enjoyed these tests." />
  </Node>
</ParseTree>

This is the beginning.
I hope you enjoyed these tests.

Test 5:
p.Parse(@"<<HELLO + <<MIDDLE + ""<--- And now it's over --->""
This is the beginning.
HELLO
And this is the middle.
MIDDLE");

<ParseTree>
  <Node Term="program">
    <Node Term="HereDoc" Terminal="HereDocTerminal" Value="
This is the beginning.
" />
    <Node Term="HereDoc" Terminal="HereDocTerminal" Value="
And this is the middle.
" />
    <Node Term="string" AstNodeType="LiteralValueNode" Terminal="StringLiteral" Value="<--- And now it's over --->" />
  </Node>
</ParseTree>

This is the beginning.
And this is the middle.
<--- And now it's over --->

Test 6:
p.Parse(@"<<HELLO + ""<--- this is the end --->""
This is the beginning.
HELLO");

<ParseTree>
  <Node Term="program">
    <Node Term="HereDoc" Terminal="HereDocTerminal" Value="
This is the beginning.
" />
    <Node Term="string" AstNodeType="LiteralValueNode" Terminal="StringLiteral" Value="<--- this is the end --->" />
  </Node>
</ParseTree>

This is the beginning.
<--- this is the end --->
Coordinator
Aug 29, 2011 at 6:25 PM

Ok, thanks for the good news!

Coordinator
Sep 8, 2011 at 1:21 AM
Edited Sep 8, 2011 at 1:28 AM

Hi

I looked at the code, in fact keep looking at it and keep thinking. Some things I don't quite get - why you need to add this flag ForcePreviewPosition? I assume you can't get without it, but it's not so good to change scanner interface because of a single terminal. Overall, I start to think we should try different approach - what we have now is a bit tricky, and a bit too much code. I believe we can get it better. Here's the plan.

We implement the functionality using a combination of a terminal and token filter. Terminal is used to scan/produce header tokens in the expressions (those that start with "<<<"). At the same time the token filter listens to the token stream from scanner and notices HereDoc terminals. It accumulates them in array (but passes to the parser also) and waits for NewLine token. At this moment it jumps into action. It goes through the list of accumulated HereDoc tokens, takes the "end tag" from each, finds it in the input text after current position, grabs the text and puts it into token.Value field of hereDoc token. After running thru all hereDoc tokens, it pushes the scanner position beyond the last position of HereDoc text(s), clears the accumulated HereDoc token list and - that's it. It starts listening for new HereDoc tokens.

I think this should result in more clear and understandable code. And you won't need to interfere with scanner working so much. Good thing your unit tests don't need to be changed.

One interesting extra twist: to define ITokenFilter interface (instead of base class), and then combine HereDocTerminal and token filter (make terminal implement ITokenFilter) - making a single self-contained class instead of two.

Can you please try this? I understand, that's a bit frustrating, to change the course when code is already completed and seems to be working ok. Sorry and thanks in advance.

Roman