Performance issues when parsing large delimited files

Jan 15, 2010 at 9:45 PM

I've done using Irony for a while, and I just started using it on a new project.  The project is basically to parse 2Mb to 12Mb text files that are pipe delimited. Each line begins with an Identifier term and is followed by that identifier type's data.  The next line may be a sub-identifier (i.e. a child node) or another identifier depending on what it's first term is.

Example:

A |   ID1 | data1 | 1 | 12-31-2010

A1 | ID1 | subAdata1 | 23.00

B | ID1 | subBdata1 | text

A2 | ID1 | subAdata2 | 50.00

C | ID2 | 5 |  text

...

Note that A and C are the Identifiers and A1, A2, and B are sub-identifiers of A.

 

My problem is that I have created a grammar that will parse this, however on a file that is 11Mb, the Grammar Explorer hangs and it never seems to produce a resulting ParseTree.  My grammar has around 170 states. Any ideas on what may be causing my system to hang?  Are there performance issue with Irony if the input is a very large text file?

 

Thanks,

MindCore

Coordinator
Jan 15, 2010 at 11:52 PM

I'm pretty sure that this has nothing to do with Irony performance by itself (parsing), but due to excessive time it takes grammar explorer to fill in the visual control (Parse tree) - probably thousands of records. Just trace the "Run" button click inside Grammar Expl - you'll see it dies when it goes to update the info in  the form

Roman

Coordinator
Jan 15, 2010 at 11:58 PM

So to conclude - don't use Grammar Explorer for big files, just for limited test fragments; for large files use it from you app or create small console app to do this - once you tested the grammar in Grammar Explorer

Jan 18, 2010 at 3:47 AM

Hey Roman,

Thanks for the quick response. 

It appears to be an issue with my grammar and not Irony.  I believe I have some rules set that are causing some recursion.  I am using the fairly new DsvLiteral Terminal, which is a pretty nice, powerful feature.  The trouble seems to root from my implementation to handle line breaks in my data versus actual line breaks.  On a few definitions of my DsvLiterals Terminals, I have a very special set of characters,  /n/u0003, that represents a data line break.  So I have set my terminators for these to "\r|" and included the characters \n and \u0003 in the grammar white-space. I believe that this should work properly, however I researched further into the construction of the NewLine Terminal and see that the character set \r \n is hard-wired.  So my current approach won't work.

I would like to know if you have any suggestions for a recommended approach?  I believe I may need to probably create a modified version of the NewLine Terminal to only looks for the /r character or check to see if \n or \r are defined as white-space. 

Thanks,

MindCore

Jan 18, 2010 at 3:39 PM

Roman,

After taking another look at the NewLineTerminal I see that I am over complicating this. The NewLineTerminal only triggers when the current character is \r and the preview character is \n.  This means that it will not trigger on my special character set.  Originally, I thought that it triggered when either both or one of the characters occurred.

Sorry for any confusion. I'll plug away at what I have and isolate my original issue; the recursion.

Thanks,

MindCore