custom escapes (not the \uXX kind)

Jul 7, 2009 at 12:27 PM
Edited Jul 7, 2009 at 12:57 PM

I'm using irony and absolutely thrilled with the results, except for the fact that I cannot get it to let me have custom escapes that would otherwise render as key characters that match other terminals. For example, I have something that expects a list of comma separated values

 

[ABC:1234, DEF:asdf343434, GHi:5kfk9 ]

 

I want to be able to have between the colon and the , any number of \, or \] without triggering the , and ] to misfire as an early end to the sequence.

I suspect I can do this somehow with Token Filters... but I can't figure it out (since it's not brace matching or whitespace preservation, haha) .. I've been staring at this all night...  any ideas?

I realize this makes it look silly, but it's not my data, I'm just parsing it.

 

Here's an example of the 'mangled' escaped text I'd like for it to treat the same way as the above.

[ABC:1234, DEF:asdf3[43434\], GHisds:2\,5kfk9 ]

 listItem.Rule = plain_ident + Symbol(":") + tricky_ident_accepting_strange_chars_n_escapes;

 listItemList.Rule = MakeStarRule(listItemList, Symbol(","), listItem);

 bracked_list = Symbol("[") +  listItemList.Q() + Symbol("]");

 

Sorry for the incompleteness the first time round.

Jul 7, 2009 at 3:59 PM

trying some more things, this didn't quite work, but I think I'm close here...

RegexBasedTerminal accex = new RegexBasedTerminal("acc", @":(([\S\w][^,])|(\\.))+", ":");

listItem.Rule = plain_ident + accex;

still doesn't work.  The regex itself matches when I do a RegEx.match by itself, but I can't pick up the tokens with this guy.

--dave

Coordinator
Jul 7, 2009 at 4:19 PM

You need to create custom terminal class that can read this sequence, I'm afraid none of the standard Irony terminals can do it. As a variation, you can use CustomTerminal and provide a custom Match function, but I think creating custom terminal class would work better. One tricky part is specifying prefixes  for the terminal. Irony scanner uses the prefixes list to select candidate terminals for scanning the input based on current character in the input. Here's how it should work. You specify all a-z characters plus digits plus back slash char as prefix. Now the scanner can be confused because this terminal is kinda "i can parse any" thing, so it would be called for almost any input character, including identifiers, numbers etc. However, there's a remedy in this case. If you use latest version of Irony from Sources page, it has a facility called "Scanner-Parser link". When Scanner has more than one candidate terminal for scanning, it can ask parser "what terminals do you expect in current state?", and filters out all terminals that are not expected. Look at Basic sample, it uses this facility to distinguish between "number", "file number" and "line number" terminals - they are all alike if we look at content, but scanner uses parser help to pickup correct terminal. So in your case, when parser reads ":" symbol, it expects only your fancy custom terminal, so it should be correctly picked up.

As for implementation of TryMatch method of the terminal, I think you can easily write it, just scan chars one-by-one and note the escaper backslash, so don't treat the following comma or bracket as the end.

Another note - in the last expression, if you define list as Star-list, you don't need "Q()" part in listItemList.Q() expression - it is already optional list. This extra Q() may in fact bring extra parsing conflicts.

Let me know how it works for you, or if you need any  more info.

Roman

Jul 7, 2009 at 6:01 PM

alright I'll give it a shot.

I have to admit I'm a little confused by what I see out there, but I think I can get it. (It took me a while to figure out the type-dispatched constructor logic for the AST nodes etc, it was too clever for me to take in all at once!)  

I'm going to start the terminal at the ':' that way I don't have to worry about all the crazy possibilties for prefixes. The left hand side of the ':' doesn't need to support escaping, which is good.  (Good call on the .Q() too)

I'll just fire up a custom Terminal class and see how I can rock it.

I'm really anxious to get this to work, cause the parser and tree generation is flawless except for this one bit, and only one out of every 50,000 entries or so that I'm parsing actually has one of these little buggers in it so I don't want to get all drastic and filter them out of the input stream or something.  I do have to say that working with irony has been really great. Indeed this is one issue I'm having is tough in any parsing tool, not just irony.

 

Coordinator
Jul 7, 2009 at 6:13 PM

well, good luck

this must be a good call for a "standard" terminal that can flexibly do these kind of things. It might have optional start symbol and end symbols ("]", "," in your case) but unlike string literal the end symbol is not part of the token. Plus escape table that code can fill in : {"\]" -> "]" ; "\,"-> ","}, something like this...

I'll probably give it a try - or if you create your terminal in such generic way, you probably can share it? I'll include it in Irony with proper credits

Roman

 

Jul 7, 2009 at 8:18 PM

 

Not exactly generic, but it's what I used that worked.  I will perhaps get a chance to pretty it up at some point here.  Thanks again for your help!
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace Irony.Compiler {
    public class AccessTerminal : Terminal {

        public AccessTerminal(string nam)
            : this(nam, string.Empty) {
        }

        public AccessTerminal(string nam, string exTerms) :
            base (nam, TokenCategory.Content) {

            if (!string.IsNullOrEmpty(exTerms))
                extraTerminators = exTerms;
            else
                extraTerminators = defaultExtraTerminators;

        }

        public override IList<string> GetFirsts() {
            return new List<string>(new string[] { ":" });
        }
        const string defaultExtraTerminators = ", \t\r\n\v]";
        
            
        string extraTerminators =  null;

        bool CharIsTerminator(char c) {
            return extraTerminators.IndexOf(c) != -1;
        }

        public override Token TryMatch(CompilerContext context, ISourceStream source) {
            
            // the first char must always every time be a ':', makes it simpler.
            if (source.CurrentChar != ':') return null;
            if (char.IsWhiteSpace(source.NextChar)) return null;
            // consume the ':'
            source.Position++;

            
            //bool breakOff = false;
            bool isEscaped = false;
            while (!source.EOF()) {
                
                    char c = source.CurrentChar;
                    if (!isEscaped) { // we are not escaped
                        if (c == '\\') { // starting an escape
                            isEscaped = true;
                        }
                        else { // we are not starting an escape, so we might bust out here.
                            if (CharIsTerminator(c))
                                break;
                        }
                    }
                    else { // we are escaped
                        source.Position++;
                        isEscaped = false; // not anymore
                    }

                    // increment the position counter
                    source.Position++;
            }
            return Token.Create(this, context, source.TokenStart, source.GetLexeme());
        }
    }
}

 

 

Coordinator
Jul 7, 2009 at 8:51 PM

Looks ok, except one possible issue - does your input format allow "escaped backslash"? or several escaped backslashes?

If it does, make sure it works correctly. We had this issue in string literal, look there, we have to introduce extra flipping flag; but string literal is implemented differently, it uses search, so your implementation going char by char might be ok as it is.

Also, I guess the escaping backslash itself is not part of the value, you should skip it. Your code blindly includes it into the token when you call GetLexeme() - you grab all chars from token start to current position.

I think you should form the string as you go skipping the escape char.

congrats and good luck

Roman

Coordinator
Aug 3, 2009 at 2:35 AM

The latest source drop contains new FreeTextLiteral that implements this functionality in generic way

Thanks again for suggesting this

have fun!

Aug 3, 2009 at 5:13 PM

you're the man, roman.