Recursive string literal

Mar 4, 2010 at 7:57 PM

Hello all,

  I'm investigating Irony for a simple templating language I'm developing.  This template language can have script embedded within it.  I don't want to parse the script, I only want to extract it from my template and hand it off to another parser (this is a 3rd party script and parser).  My syntax is going to look something like this:

 

-script {

  other script language goes in here.

}

 

I'd like to return everything within the curly braces as a string literal so that I can just pass it off to the other parser.  The catch, though, is that the other script language uses curly braces itself, so my string literal has to be smart enough to match all embedded curly braces before allowing the final curly brace to close the string literal.  Here is an example:

 

-script { otherScriptMethodDeclaration(params) {

  if(something) {

    ...

  }

}}

 

How would I do this in Irony?  Write a custom StringLiteral?  Any tips on doing that?

Thanks!

Coordinator
Mar 5, 2010 at 5:52 PM

Yeah, looks like you need a custom string literal, allowing nested start/end symbols. This is in fact common case, some languages (Lua as far as I remember) have nested comments and string literals with nested start/end symbols, so implementing this facility is on my to-do list. For now, you'll have to do it yourself

Roman

Mar 5, 2010 at 8:43 PM
Edited Mar 5, 2010 at 8:47 PM

OK, I hacked around and got something working, but I should warn you that I did not spend much time learning Irony's architecture, so this may be really botched up. :)

Here is the "EmbeddedScriptLiteral" I created:

 

  public class EmbeddedScriptLiteral : Terminal
  {
    public Terminal[] QuoteAndCommentTerminals { get; set; }

    public string StartSymbol { get; set; }
    public string EndSymbol { get; set; }

    public EmbeddedScriptLiteral(string name, string startSymbol, string endSymbol,
                                 params Terminal[] quoteAndCommentTerminals)
      : base(name)
    {
      QuoteAndCommentTerminals = quoteAndCommentTerminals;
      StartSymbol = startSymbol;
      EndSymbol = endSymbol;
    }

    public override IList<string> GetFirsts()
    {
      return new StringList(StartSymbol);
    }

    public override Token TryMatch(ParsingContext context, ISourceStream source)
    {
      if(context.VsLineScanState.Value != 0)
      {
        context.VsLineScanState.Value = 0;
      }
      else
      {
        if(!BeginMatch(context, source)) return null;
      }

      Token result = CompleteMatch(context, source);
      if(result != null)
      {
        return result;
      }
      return source.CreateErrorToken("Unclosed Script Block");
    }

    protected virtual bool BeginMatch(ParsingContext context, ISourceStream source)
    {
      if(!source.MatchSymbol(StartSymbol, !Grammar.CaseSensitive))
      {
        return false;
      }
      source.PreviewPosition += StartSymbol.Length;
      return true;
    }

    protected virtual bool EndMatch(ParsingContext context, ISourceStream source)
    {
      if(!source.MatchSymbol(EndSymbol, !Grammar.CaseSensitive))
      {
        return false;
      }
      source.PreviewPosition += EndSymbol.Length;
      return true;
    }

    protected virtual Token CompleteMatch(ParsingContext context, ISourceStream source)
    {
      int startingPosition = source.PreviewPosition;
      SourceLocation initialLocation = source.Location;
      // The whole point here is to match Start and End Symbols... so, for each
      // StartSymbol, there has to be a corresponding EndSymbol.  We start at 1
      // because we've already matched the first StartSymbol.
      int braceLevel = 1;

      while(!source.EOF() && braceLevel > 0)
      {
        if(SkipQuoteOrComment(context, source))
        {
          continue;
        }

        if(BeginMatch(context, source))
        {
          braceLevel++;
        }
        else if(EndMatch(context, source))
        {
          braceLevel--;
        }
        else
        {
          source.PreviewPosition++;
        }
      }

      var previewPosition = source.PreviewPosition;

      // Grab all text between opening and closing symbols.
      var endPos = source.PreviewPosition - EndSymbol.Length;
      if(endPos > source.Text.Length) endPos = source.Text.Length;

      source.Location = initialLocation;
      source.PreviewPosition = previewPosition;

      var ret = source.CreateToken(OutputTerminal, source.Text.Substring(startingPosition, endPos - startingPosition));
      source.PreviewPosition = previewPosition;
      return ret;
    }


    protected virtual bool SkipQuoteOrComment(ParsingContext context, ISourceStream source)
    {
      source.Location = new SourceLocation(source.PreviewPosition, source.Location.Line, source.Location.Column);
      foreach (var quoteOrCommentTerminal in QuoteAndCommentTerminals)
      {
        var token = quoteOrCommentTerminal.TryMatch(context, source);
        if(token != null)
          return true;
      }

      return false;
    }
  }

I set it up like this (note, I'm pretending C# is the embeddable "script" language):

 

...
      // C# Script Blocks
      var CSharpScriptLiteral = new EmbeddedScriptLiteral("CSharpScriptLiteral", "{", "}",
                                                        CSStringLiteral, CSCharLiteral, CSSingleLineComment,
                                                        CSDelimitedComment);

...

In my custom language you surround a script block with "{" and "}", and this EmbeddedScriptLiteral will pull all text until the final closing brace.  While it doesn't parse the script in the block, per se, it does do brace matching while also ignoring braces found in strings and comments so that it knows where the block ends.  Beyond that, it returns all enclosed text without any processing at all (escape sequences are kept intact so that the target scripting engine can handle them).

It seems to be working in my limited test cases.  Let me know if this looks totally wrong to you. :)

 

 

Coordinator
Mar 10, 2010 at 7:36 PM

Looks ok to me, as long as it works properly. The only problem I see is efficiency - when scan the content of the terminal, you try EVERY quoteOrComment terminal for EVERY position inside. It is better to apply some optimization - get all prefixes of quote/comment terminals, and search for first letters of prefixes, using Text.IndexOfAny(char[])

Mar 11, 2010 at 12:02 AM

I had the same thought... my goal with the above code is simply a proof of concept, which seems to be working.  If it becomes a performance issue for us, that will be the first optimization we try. :)  Thanks for your comments!