re Project #8 - Validatd RegEx'es

Oct 13, 2012 at 3:50 AM

For my music parser I sub-classed RegexBasedTerminal to only accept the enumeration constants of an Enum type, as thus:

   /// <summary>
   /// RegexBasedTerminal sub-class that parses the constants of enumeration <i>TValue</i> 
   /// in determining the Value of the Terminal.
   /// </summary>
   /// <typeparam name="TValue">Must be an <i>Enum</i> type, but only enforced at run-time
   /// as a constraint like <i>TValue : System.Enum</i> is forbidden.</typeparam>
   public class MyRegexTerm<TValue> : RegexBasedTerminal where TValue : struct {
   /// <summary>
   /// RegexBasedTerminal sub-class that parses the constants of enumeration <i>TValue</i> 
   /// in determining the Value of the Terminal.
   /// </summary>
      public MyRegexTerm(string pattern, Func<string,TValue> fromString = null) 
         : base(pattern) {
         EnumType               = GetType().GetGenericArguments()[0].UnderlyingSystemType;
         if (!EnumType.IsEnum) throw new ArgumentOutOfRangeException("T",EnumType.Name,
                                                "Generic argument must be Enum type.");
         Name                  = EnumType.Name;
         FromString            = fromString;
         AstConfig.NodeType    = typeof(LiteralValueNode<TValue>);
      }

      protected Func<string,TValue> FromString   { get; set; }
      protected Type                EnumType     { get; set; }
      protected virtual TValue ConvertValue(ParsingContext context, string textValue) {
         return (FromString == null) ? (TValue)System.Enum.Parse(EnumType,textValue.ToUpper())
                                     : FromString(textValue);
      }

      public override Token TryMatch(ParsingContext context, ISourceStream source) {
         Token token = base.TryMatch(context, source);
         if (token != null)    token.Value = ConvertValue(context,token.ValueString);
         return token;
      }
   }

Oct 13, 2012 at 3:57 AM
Edited Oct 13, 2012 at 3:58 AM

Here are some instances of its use in the music parser:

   var modeStyle  = new MyRegexTerm<Style>(@"[NLS]");
   var shift      = new MyRegexTerm<OctaveShift>(@"O?[<>]", 
                    s => OctaveShift.Up.FromString(s));
   var noteLetter = new MyRegexTerm<NoteLetter>(@"[CDEFGAB]");
   var sharpFlat  = new MyRegexTerm<SharpFlat>(@"[-#+]", 
                    s => SharpFlat.Natural.FromString(s));

Where the enumeration type suits, System.Enum.Parse(string value) is used;
but a custom validator can be provided as for the Terminals shift and sharpFlat
above.

It also has the beneficial side-effect of naming the Terminal after the Enum type
;-)

Pieter


Coordinator
Oct 16, 2012 at 6:14 PM

nice idea, but keep in mind that RegexTerminal is very slow compared to other terminals like KeyTerm. You are using regexes to recognize a fixed set of names and convert to enum values. If you're ok with trading performance for a slick definition - that's fine. But that's not the case for everyone. 

Oct 16, 2012 at 9:51 PM

Do you have any measurements on how bad the efficiency trade-off is? Every well designed Regex for an application like this will be non-back-tracking, and should replace multiple instances of other terminals.

I have always designed primarily for readability, maintainability and verifiability, and worried about performance only where a problem has been identified; but that has limits.

Coordinator
Oct 17, 2012 at 6:05 AM

well, no exact numbers, just common sense. Regex is a whole engine, with its own complex pseudo-code. At the same time, recognizing a few string constants by simple sub-string matching (as in case of KeyTerms) is much simpler and straightforward op. Again, if your music parser parses relatively small files and interprets them, it does not matter if it takes 1ms or 10ms to recognize a term. But if you are processing huge data files, or code files, it does matter. 

So RegexBased terminal works fine for you, but definitely should be avoided in other perf-critical cases. 

Oct 17, 2012 at 6:30 AM

OK; but you might be surprised. The first version of the Music parser was built entirely with Regex, with the grammar twisted around so that every non-terminal comprised exactly 3 tokens to keep the groups synchronized. When that implementation hit the limit I went searching for another tool and found Irony. But Irony required twice as long to parse the same (simple yet twisted) grammar as Regex did. Granted not an apples to apples comparison, and only at 50ms compared to 100 ms for a music piece several minutes long. In 25 years the only time Regex has misbehaved for me it was my fault for being careless.

Oct 17, 2012 at 6:31 AM

I see you found my profile; too bad it's not up to date. Did you see any of my bridge results?

Coordinator
Oct 17, 2012 at 5:31 PM

Irony's slowness compared with pure Regex parser - as far as I understood, your Irony-based parser used Regex terminals extensively? then what you're comparing is one regex arrangement against another. Not to say that Irony is perfect on perf side, I'm thinking about giving a good refactoring to scanner and terminals, with one of the goals to improve performance. 

Ye, looked you up on google, was curious are you a musician turned compiler hacker or the other way around. Looks like neither :)

Oct 17, 2012 at 7:37 PM

All true, and I was probably abusing Irony as much as using it at that stage. Before you tackle your refactoring, check out my MyIrony wrapper; it may provide occasional inspiration. Since I am actually building parsers for two distinct music grammars, I have abstracted as much (Irony.dll based) common code as possible into that.

I am a physicist who took up software development 30 years ago. When my most recent project completed I searched for a project to sink my teeth into while learning C#. I started with porting the old Q-Basic Gorilla game (which I first ported to VB-6 about 15 years ago) over to C#, but got sidetracked into enabling the sound strings, and then scope-creep took over.

I have built a handful of DSL's over the years with lex/yacc derivatives, and couldn't resist the allure of Irony when I stumbled across it here:
http://www.codeproject.com/Articles/26975/Writing-Your-First-Domain-Specific-Language-Part-1. I have been fascinated by compilers since reading Godel, Escher, Bach 30 years ago (and also aced Bob Tennent's Denotational Semantics course way-back-when, if that name means anything to you).