How to parse HTML text content using irony?

Sep 20, 2013 at 2:50 PM
Hi,

I am trying to create Irony grammar to parse HTML text. Following is the code, I am using inside grammar class.
public class HTMLGrammar : Irony.Parsing.Grammar
{
        public HTMLGrammar()
        {
            KeyTerm leftAnguralBracket = new KeyTerm("<","LeftAngularBarakcet");
            KeyTerm rightAnguralBracket = new KeyTerm(">", "RightAngularBarakcet");
            KeyTerm forwardSlash = new KeyTerm("/", "ForwardSlash");
            NonTerminal element = new NonTerminal("Element");
            NonTerminal emptyElementTag = new NonTerminal("EmptyElementTag");
            NonTerminal startTag = new NonTerminal("StartTag");
            NonTerminal content = new NonTerminal("Content");
            NonTerminal endTag = new NonTerminal("EndTag");
            RegexBasedTerminal name = new RegexBasedTerminal("Name", "\\w+");

            element.Rule = emptyElementTag | startTag + content + endTag;
            emptyElementTag.Rule = leftAnguralBracket + name + forwardSlash + rightAnguralBracket;
            startTag.Rule = leftAnguralBracket + name + rightAnguralBracket;
            endTag.Rule = leftAnguralBracket + forwardSlash + name + rightAnguralBracket;
            content.Rule = MakeListRule(content, element, element);
            this.Root = element;
        }
}
When I use the above grammar class to parse the sample HTML text, it fails to identify the first end tag.

Sample text : <html><html></html></html>

Irony successfully identifies first 8 tokens till forward slash of first end tag and failed to identify the name token inside end tag. It gives an error "Syntax Error, Expected : Name" which is actually their in the given html text.

Not sure what changes required in grammar to successfully the text successfully.


Any help is appreciated.

Thanks
Arun Malik
Coordinator
Sep 20, 2013 at 4:42 PM
are you trying it in Grammar Explorer? if yes - are there any grammar errors/conflicts? If no, do it first
Sep 20, 2013 at 5:52 PM
Hi,

I have tried in Grammar Explorer as well. Following is the parser trace from Grammar Explorer. I cannot make out the required changes in grammar class from parse trace.

Image


Thanks
Coordinator
Sep 20, 2013 at 5:58 PM
  1. does grammar expl show any errors on grammar errors page?
  2. It seems the '</' combination should be declared as one terminal, this likely solve your problem; also it would be more consistent with HTML standard, so that '</' cannot contain any spaces inside
Sep 20, 2013 at 6:37 PM
Edited Sep 20, 2013 at 7:04 PM
  1. There are no Grammar Errors.
  2. Even after combining '</' as one terminal, parsing is not happening. Following is the updated Grammar Class.
Grammar Explorer Parse Trace :

Image

Updated Grammar Class :
public class HTMLGrammar : Irony.Parsing.Grammar
    {
        public HTMLGrammar()
        {
            KeyTerm leftAnguralBracket = new KeyTerm("<","LeftAngularBarakcet");
            KeyTerm rightAnguralBracket = new KeyTerm(">", "RightAngularBarakcet");
            KeyTerm leftAngularBracketEndTag = new KeyTerm("</", "LeftAngularBracketEndTag");
            KeyTerm rightAngularBracketEndTag = new KeyTerm("/>", "RightAngularBracketEndTag");


            NonTerminal element = new NonTerminal("Element");
            NonTerminal emptyElementTag = new NonTerminal("EmptyElementTag");
            NonTerminal startTag = new NonTerminal("StartTag");
            NonTerminal content = new NonTerminal("Content");
            NonTerminal endTag = new NonTerminal("EndTag");
            RegexBasedTerminal name = new RegexBasedTerminal("Name", "\\w+");

            element.Rule = emptyElementTag | startTag + content + endTag;
            emptyElementTag.Rule = leftAnguralBracket + name  + rightAngularBracketEndTag;
            startTag.Rule = leftAnguralBracket + name + rightAnguralBracket;
            endTag.Rule = leftAngularBracketEndTag + name + rightAnguralBracket;
            content.Rule = MakeListRule(content, element, element);

            this.Root = element;
        }
    }
3. Grammar Explorer is defining following parsing states from the Grammar.
State S0
Shift items:
Element' -> ·Element EOF 
Element -> ·EmptyElementTag 
EmptyElementTag -> ·LeftAngularBarakcet Name RightAngularBracketEndTag 
Element -> ·StartTag Content EndTag 
StartTag -> ·LeftAngularBarakcet Name RightAngularBarakcet 
Transitions: Element->S1, EmptyElementTag->S2, LeftAngularBarakcet->S3, StartTag->S4

State S1
Shift items:
Element' -> Element ·EOF 
Transitions:

State S2
Reduce items:
Element -> EmptyElementTag ·
Transitions:

State S3
Shift items:
EmptyElementTag -> LeftAngularBarakcet ·Name RightAngularBracketEndTag 
StartTag -> LeftAngularBarakcet ·Name RightAngularBarakcet 
Transitions: Name->S6

State S4
Shift items:
Element -> StartTag ·Content EndTag 
Content -> ·Content Element Element 
Content -> ·Element 
Element -> ·EmptyElementTag 
EmptyElementTag -> ·LeftAngularBarakcet Name RightAngularBracketEndTag 
Element -> ·StartTag Content EndTag 
StartTag -> ·LeftAngularBarakcet Name RightAngularBarakcet 
Transitions: Content->S7, Element->S8, EmptyElementTag->S2, LeftAngularBarakcet->S3, StartTag->S4

State S5
Reduce items:
Element' -> Element EOF ·
Transitions:

State S6
Shift items:
EmptyElementTag -> LeftAngularBarakcet Name ·RightAngularBracketEndTag 
StartTag -> LeftAngularBarakcet Name ·RightAngularBarakcet 
Transitions: RightAngularBracketEndTag->S9, RightAngularBarakcet->S10

State S7
Shift items:
Element -> StartTag Content ·EndTag 
EndTag -> ·LeftAngularBracketEndTag Name RightAngularBarakcet 
Content -> Content ·Element Element 
Element -> ·EmptyElementTag 
EmptyElementTag -> ·LeftAngularBarakcet Name RightAngularBracketEndTag 
Element -> ·StartTag Content EndTag 
StartTag -> ·LeftAngularBarakcet Name RightAngularBarakcet 
Transitions: EndTag->S11, LeftAngularBracketEndTag->S12, Element->S13, EmptyElementTag->S2, LeftAngularBarakcet->S3, StartTag->S4

State S8
Reduce items:
Content -> Element ·
Transitions:

State S9
Reduce items:
EmptyElementTag -> LeftAngularBarakcet Name RightAngularBracketEndTag ·
Transitions:

State S10
Reduce items:
StartTag -> LeftAngularBarakcet Name RightAngularBarakcet ·
Transitions:

State S11
Reduce items:
Element -> StartTag Content EndTag ·
Transitions:

State S12
Shift items:
EndTag -> LeftAngularBracketEndTag ·Name RightAngularBarakcet 
Transitions: Name->S14

State S13
Shift items:
Content -> Content Element ·Element 
Element -> ·EmptyElementTag 
EmptyElementTag -> ·LeftAngularBarakcet Name RightAngularBracketEndTag 
Element -> ·StartTag Content EndTag 
StartTag -> ·LeftAngularBarakcet Name RightAngularBarakcet 
Transitions: Element->S15, EmptyElementTag->S2, LeftAngularBarakcet->S3, StartTag->S4

State S14
Shift items:
EndTag -> LeftAngularBracketEndTag Name ·RightAngularBarakcet 
Transitions: RightAngularBarakcet->S16

State S15
Reduce items:
Content -> Content Element Element ·
Transitions:

State S16
Reduce items:
EndTag -> LeftAngularBracketEndTag Name RightAngularBarakcet ·
Transitions:
Thanks
Coordinator
Sep 20, 2013 at 7:59 PM
Change content rule to

content.Rule = MakeStarRule(content, element);