Parser performance -- any tips?

Feb 16, 2010 at 11:56 PM

Hi, I've created a grammar for a very basic language. Basically you can write: {someguid1} [, dist({someguid2}[ , : {someguid3}] )].

The problem is that it's taking roughly 15ms to generate the parse tree which is unacceptable as there are a large number (up to ~60000) strings to parse. Typically the number of terminals within the string won't be very large (generally < 5, possibly up to 50).

I'll be running a profiler over this in the coming days but thought I'd ask if there are any obvious tips before I begin?

Thanks in advance.


Imports Irony.Parsing

Namespace Grammars
    <Language("Qubit LocSpec GUID Format", "1.0", "Grammar for the definition of a persisted location specification within a dimension.")> _
    Public Class clsLocSpecGuidGrammar
        Inherits clsQubitGrammar

#Region "Constuctors"

        ' Matches a GUID (e.g. '{00000000-0000-0000-0000-000000000000}')
        Private Const GuidRegex As String = "\{[0-9a-f]{8}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{4}\-[0-9a-f]{12}\}"

        Private Const StatementsIdentifier As String = "stmts"
        Private Const StatementIdentifer As String = "stmt"
        Private Const GuidsIdentifier As String = "guids"

        Public Const GuidIdentifier As String = "guid"
        Public Const ValueDistributionIdentifer As String = "valueDist"
        Public Const VariableDistributionIdentifer As String = "variableDist"
        Public Const VariableIdentifier As String = "variable"

        Public Sub New()
            MyBase.New(False) ' False means case insensitive

            GrammarComments = "Expressions for persisted dimension location specifications."

            Dim stmts = New NonTerminal(StatementsIdentifier)
            Dim stmt = New NonTerminal(StatementIdentifer)

            Dim valueDist = New NonTerminal(ValueDistributionIdentifer)
            Dim variableDist = New NonTerminal(VariableDistributionIdentifer)

            Dim guids = New NonTerminal(GuidsIdentifier)

            Dim guid = New RegexBasedTerminal(GuidIdentifier, GuidRegex)

            Dim variable = New RegexBasedTerminal(VariableIdentifier, GuidRegex)

            Dim dist As KeyTerm = ToTerm("dist")
            Dim leftBracket As KeyTerm = ToTerm("(")
            Dim rightBracket As KeyTerm = ToTerm(")")
            Dim comma As KeyTerm = ToTerm(",")
            Dim colon As KeyTerm = ToTerm(":")

            ' Specify the non-terminal which is the root of the AST.
            Root = stmts

            stmts.Rule = MakePlusRule(stmts, comma, stmt)
            stmt.Rule = valueDist Or variableDist Or guid

            guids.Rule = MakePlusRule(guids, comma, guid)

            variableDist.Rule = dist + leftBracket + guids + colon + variable + rightBracket
            valueDist.Rule = dist + leftBracket + guids + rightBracket

            RegisterPunctuation(dist, leftBracket, rightBracket, comma, colon)

            LanguageFlags = LanguageFlags.CreateAst
        End Sub

#End Region

    End Class

End Namespace

Feb 17, 2010 at 1:27 AM

I think it's obvious - use of Regex terminal. It eats all your processing time, regex's are inherently slow. Write custom Guid terminal.

Feb 17, 2010 at 4:31 PM

As a second thought - you might create your custom GuidTerminal by sub-classing it from DataTerminalBase and override method ConvertValue.

Feb 18, 2010 at 1:11 AM

I only just noticed your second suggestion -- and I haven't tried it -- although I don't imagine it would speed things up? Also, did you mean DataLiteralBase or is there a new class that released recently that I don't have in my build?

I've tested the RegexBasedTerminal against the (VERY simple - assume that { is disallowed unless it's a valid GUID) GuidTerminal below and the difference is basically identical (35626ms using RegexBasedTerminal, 35895 using the GuidTerminal). I'm still planning on pulling out a profiler when I get a chance (this is actually a completed piece of functionality that could probably be improved.)

        Public Class GuidTerminal
            Inherits Terminal

            Public Sub New(ByVal name As String)
            End Sub

            Public Overloads Overrides Function TryMatch(ByVal context As ParsingContext, ByVal source As ISourceStream) As Token
                If (source.PreviewChar <> "{"c) Then
                    Return Nothing
                End If

                source.PreviewPosition += 38 ' Length of a GUID.

                Return source.CreateToken(OutputTerminal)
            End Function

        End Class

Feb 18, 2010 at 5:07 PM

How are you measuring performance? what are these numbers you provide? Are you measuring the time in Grammar Explorer or some similar tool with UI? Keep in mind that you may be measuring time it takes to fill the tree/listbox/grid - whatever you use for display. 35 seconds seem like very long time, on average Irony parses at 10k lines per second for complex grammar like c#, unless you are parsing 500k lines file. Try it in grammar explorer - it shows you pure parsing time, without time spent on updating UI.


Feb 28, 2010 at 11:48 PM

Hi. Firstly, apologies for the slow reply.

I was performing the timing using a Stopwatch which was started directly before the .Parse() and stopped directly afterwards. What was interesting to note was that the times seemed to fluctuate massively, even though the input was relativey the same (the input is a actually a large number of small fragments (up to ~60, 000), rather than one large fragment).

After some more research it appeared that the problem only happened in debug mode while referencing a release DLL, which is very strange behaviour. For the moment I've added the Irony project to our solution which is an immediate fix - it works fast in debug and is compiled as a release DLL.

When I get some more time I'll certainly be looking into this more as it seems like very, very strange behaviour.

Thanks for all your assistance. The parsing times are now basically negligible with respect to the other processing that takes place.

Mar 2, 2010 at 5:18 PM


I think I know what the problem is. The key here is your phrase that the input is large number of small fragments rather than one big file. If you look at SourceStream class, you'll see that there are two different implementations of CurrentChar method - for  debug and release. In release mode I don't check for boundaries but allow IndexOutOfRangeException to be thrown. The idea was that it happens once at the end of (supposedly) long file, so it would be better for performance to let it happen but in return to gain on avoiding the boundaries check. Now I see that in situation like yours it can really hurt. I will probably get rid of release version and switch to always use "Debug" variant, with boundary checks.

thanks for the hint


Mar 3, 2010 at 1:51 AM

Hi Roman,

The funny thing is I read your discussion with Kirill Osenkov (him asking about live geometry/Silverlight) and did look at the code and saw the #DEBUG/#RELEASE stuff. What I'm confused about is the fact that it was only slow when referencing a release DLL in a debug environment, as opposed to a release DLL in a release environment (which was fine). AFAIK the conditionals should be evaluated when the DLL is compiled -- thus there should be no difference whatsoever if the DLL is referenced from within a debug environment or a release environment?

When I get some time I'll look into it further.



Jun 21, 2010 at 6:05 PM

Hi Roman,

I stumbled upon this #DEBUG/#RELEASE discussion and decided to add my '2 cents'.

Setting try/catch handler in release version adds more processor cycles then simple IF with boundary check in debug one, not to mention up to 100K+ burned cycles it takes .NET to throw/catch even a single exception.

Using simpler code that looks like .. return PreviewPosition < Text.Length ? Text[PreviewPosition] : '\0'; will provide best results in this situation, (use < instead of >= for more frequent outcome, as it is better aligned with CPU branch predictor policies).

Cheers, Victor.