How to handle empty fields?

Jun 21, 2010 at 5:01 PM

Hello,  I'm working on a parser that parses a particular subset of mediawiki markup:  typically this looks like

{{a|b|c|d}}

Now,  this is recursive so you can write something like

{{a|b|{{e|f}}}d}}

I've defined

Template.Rule = "{{" + DelimData + "}}";
Link.Rule = "[[" + DelimData + "]]";
ExternalLink.Rule = "[" + ExternalLinkDelim + "]";
RichText.Rule = Text | Link | Template | ExternalLink | RichText + RichText;
DelimData.Rule = RichText
                    | RichText + "|" + DelimData

 

This sorta works, so long as none of the field values are empty... However, the parser is failing in case where people use positional parameters that are empty, for instance,

{{|b|c|d}}
{{a|b||d}}
{{a|b|c|}}
{{a|||d}}
I've been able to add more branches to DelimData.Rule to some of the rules so I can parse Some of the things above,  for instance
DelimData.Rule = RichText
                    | RichText + "|" + DelimData
                    | "|" + DelimData
                    | DelimData + "|"
                    | "|";
accepts everything above except for the last one.  I'm sure there's a simple and general way to get the behavior I want but at the moment it eludes me.

Coordinator
Jun 21, 2010 at 5:09 PM
My first guess would be to make rich text optional: RichText.Rule = Text | Link | Template | ExternalLink | RichText + RichText | Empty; (notice Empty at the end) Watch for grammar errors in this case, I suspect grammar may become ambiguous because of this "RichText + RichText" clause, you need to tweak rules then
Jun 21, 2010 at 7:12 PM

What is "Empty" an instance of? 

Jun 21, 2010 at 7:13 PM

...oops,  I get it,  Empty is a member of "Grammar".

Jun 21, 2010 at 7:45 PM

Here's what I tried

OptionalRichText.Rule = RichText | Empty;
DelimData.Rule = OptionalRichText
                | OptionalRichText + "|" + DelimData;

Note that by "wrapping" RichText in OptionalRichText,  I avoid the obvious problem that you can always find an Empty at the beginning or end of a RichText.

Anyway,  this seems to work,  so far as parse rules go,  but I find the behavior of Empty to be just a bit obnoxious.  I marked OptionalRichText as a transient node so it wouldn't show up in the parse tree (make me rewrite everything).  "Empty" itself seems to be transient,  because I'm not actually seeing it in the parse tree.  This makes my code that picks template apart just a little more ugly because there are more node structures I need to look at.

So far I've also been avoiding the use of transient nodes because an important thing that my code needs to do is to re-serialize parts of the node graph.  It would be really convenient if I could hide nodes that I don't care about semantically but that have a visible "sign"