Heads up - working on new stuff

Coordinator
Feb 20, 2013 at 6:07 AM
Edited Feb 20, 2013 at 6:16 AM
Just to give some info on what's happening. I'm still here, life continues, and Irony development continues - now more actively.
I'm back to regular work on Irony, and currently looking at substantial redesign in almost all areas. My focus is GLR parser (so that we finally can ask parser to resolve conflicts automatically, thru exploring all possible paths). Also looking at rebuilding Scanner - moving away from old, traditional char-by-char model and doing something more modern (scanning/chopping the entire text), making it possible to 'replace' whole fragments of source text, in things like macro or source rewriting in LINQ). Another focus - parsing not-quite complete/correct sources (for intellisense), hopefully using some theoretical foundation (suffix parsing). If you're interested to know more - I use "Parsing techniques" by Grune, Jacobs as a source of theoretical algorithms, so read chapters 11 and 12 about GLR parsing and suffix parsing.
One of the challenges is to make changes as backward-compatible as possible, so that old grammars will continue to work without or with minimal changes. I hope this can be done.
Also move to VS 2012 is coming, in the next source push hopefully.

And if you're in database apps development - keep an eye on my other project VITA (vita.codeplex.com) - soon will push new code, with BIG set of COOL and UNIQUE features (like authorization framework). Trying to split my time between these two open-source babies - and of course my offline, real-life children (twins, 6 yo, lots of fun and trouble:).

Thank you and stay in touch
Roman
Coordinator
Feb 20, 2013 at 6:58 AM
Edited Feb 20, 2013 at 8:15 AM
Forgot to mention - timeline. Hard to predict, hope it will take no more than two-three months.

And I need some help from you guys. If there's somebody there with some expertise .NET internals, performance tuning and passion for assembly code - here's the task.
What is the most efficient way in c# to find in a given string the first location of any of the prefixes from a given set of strings?
(one straightforward solution - method String.IndexOfAny( ) with first chars of strings to search. But is it the fastest one?)

Here's a more detailed description.
There's a string in memory (ex: 500-line c# source file). Starting from some position P, find the first occurrence of any word from a set of words (ex: all special symbols in c# like { } ; + += ++ etc; or /* - comment start symbol). What is the fastest way of doing this in c#/.NET 4.5?
Note that this is not a question of theoretical algorithm, but of a concrete code, with particular classes and methods (hashsets vs lists vs arrays and stuff like that)

This will be used in new Scanner in initial pre-scanning/chopping the source file into pieces using 'prefixed' terminals like string literals, comments or any special symbols - that are fragments by themselves.
Currently I'm using string.IndexOfAny, grouping prefixes by first char, then searching for first chars, then trying to match all prefixes starting with this char. Each of these steps should be counted as well, so the best algorithm should end with a matching suffix (or even Terminal having this prefix?)

thanks
Roman
Developer
Feb 20, 2013 at 7:02 PM
Edited Feb 21, 2013 at 9:19 PM
Hi Roman!

That's absolutely great!
Very glad to hear that Irony is back into active development :)
GLR parsing is quite an interesting area to explore.
 
Currently I'm using string.IndexOfAny, grouping prefixes by first char, then
searching for first chars, then trying to match all prefixes starting with this char.
 
Can you post your current code as a starting point for the further optimization?
My straightforward implementation based on your description is as follows:
using System;
using System.Linq;

// Usage example:
// var text = "Quick Brown Fox Jumps Over The Lazy Dog";
// var index = text.IndexOfAny(3, "Over", "Fox", "The"); // returns 12

public static class StringExtensions
{
  public static int IndexOfAny(this string sourceCode,
    int position, params string[] prefixes)
  {
    var strings = prefixes.Where(s => s != null && s.Length > 0).ToLookup(s => s[0]);
    var chars = strings.Select(g => g.Key).ToArray();

    while (position < sourceCode.Length)
    {
      var index = sourceCode.IndexOfAny(chars, position);
      if (index < 0)
      {
        return index;
      }

      var @char = sourceCode[index];
      foreach (var s in strings[@char])
      {
        if (string.CompareOrdinal(sourceCode, index + 1, s, 1, s.Length - 1) == 0)
        {
          return index;
        }
      }

      position = index + 1;
    }

    return -1;
  } 
}
Coordinator
Feb 21, 2013 at 8:03 AM
yes, you are on the right track. Just a few things to emphasize. All these preparations with strings/Where, chars/Select - these should be factored out, and assumed done before we start the process, at construction stage. List of prefixes/terminals is derived from grammar, and remains the same thru the whole process. So you need to figure out what 'optimal' structures to prepare. Another thing - prefixes may be substrings of each other. Like in your example you may have 'The' and 'Them' prefix - so the prefixes should be sorted in ascending length order, so that longest are checked first.
Another thing - I think you know it, but just to make sure.. In the search process itself, you should use fastest versions of operations - so for example, use 'for' loops instead of 'foreach' - as you do in 'foreach(var s in strings[@char])...'.
Now, as an example of a place to look at. You do string.CompareOrdinal to compare entire prefix once you matched first char. One trick would be to try to compare the second char of prefix (if it has one), before trying to compare entire prefix. I used a structure like this:


class TermLookupInfo {
public Char FirstChar;
public Char SecondChar; // 0 for 1-len Prefix
public string Prefix;
public Terminal Terminal;
}
Dictionary<char, List<TermLookupInfo>> TermLookupTable = new Dictionary<char,List<TermLookupInfo>>();

I lookup TermLookupInfo by the found char, then check the second char match, then finally check the Prefix; if all passes, I call Terminal.TryMatch. The point is that reading char at position might be faster than comparing substrings, to get 'false' on match. Notice I place List<TermLookupInfo> as target value in dictionary - it should be array in the final version - which is faster than list for iterating and lookups.
So the goal is to design the structure like TermLookupInfo and lookup table and figure out how to use it efficiently.

Here are items to think about:
  1. What about creating a char array (sourceCode.ToCharArray()), and then working only with char array? Perf gain/loss depends on what's going on inside string.IndexOfAny, but if it is a plain loop over characters (not some special CPU 'scan' command) - then maybe it's worth doing it directly. Again, many repeated searches will be progressively made over the same source string, so we may do some preparation with it as well.
  2. lookup table - should it be a dictionary? Dictionary lookup is complex operation; hashing the char (how expensive it is?), then deriving the index and doing 'bucket' lookup in internal array(s). What about a plain array of 256 objects, indexed by lower byte of the character? So we just join in one bucket all infos for chars that have the same minor byte. Then table lookup might be much faster.
  3. String case matching must be configurable - depending on language case sensitivity - how this impacts the solution?
Coordinator
Feb 21, 2013 at 8:12 AM
I think a good way to go is to use c# parsing as a test case.
You have c# source code file. You have a set of Terminals from grammar that have explicit prefixes (return non-empty list in GetFirsts()):
  1. Comment terminals
  2. Literal strings and chars
  3. All special symbols (operators, braces, brackets etc) - Key terms that are not keywords (do not start with letter)
  4. Compiler directives (start with #)
  5. NewLine symbol - need to detect this to have line counter (for SourceLocation)
Build a solution for progressive scan of the entire source file. Preparation time does not matter. The goal is to locate all Terminals from above groups, based on their declared prefixes. The scan is progressive, we move forward, and at each step we need to find a position starting from currentStartPos, that has a prefix and return a Terminal for this prefix.
Coordinator
Feb 21, 2013 at 8:20 AM
This is first-step preScan, when we want to chop the entire file into fragments containing Tokens for discovered prefixed terminals, with some pieces on unscanned raw text (containing other terminals like identifiers) between them.
I consider this task of finding the fastest scanner kind of important. There were some comments out there that Irony's parsing performance is not quite the top, and some other parsers outperform it. True it or not, but I would like to have the fastest possible now. Mostly, performance is determined by Scanner - most of the time parser spends there.
So let's build a scanner that is the fastest, and kicks ass of all competing toolsets!
Developer
Feb 21, 2013 at 9:31 PM
All these preparations with strings/Where, chars/Select - these should be factored out, and assumed done before we start the process, at construction stage. List of prefixes/terminals is derived from grammar, and remains the same thru the whole process. So you need to figure out what 'optimal' structures to prepare.
 
Ah, that's for sure! Of course I shouldn't have implemented it as a general-purpose method accepting an array of strings.
 
Another thing - prefixes may be substrings of each other. Like in your example you may have 'The' and 'Them' prefix - so the prefixes should be sorted in ascending length order, so that longest are checked first.
 
Yes, I overlooked that! But the order should be descending :)
 
Another thing - I think you know it, but just to make sure.. In the search process itself, you should use fastest versions of operations - so for example, use 'for' loops instead of 'foreach' - as you do in 'foreach(var s in strings[@char])...'.
 
About loops — that's certainly true for many cases (like iterating over an array of chars or bytes), but it's not always so. In most cases, the performance is almost identical. I think, somebody had blogged about that (Jon Skeet perhaps?).

When I just posted my solution, I tried a few simple optimizations like replacing LINQ methods with custom loops, rewriting 'foreach' loops as 'for' and so on, but the differences were neglectible (i.e., within measurement error).
 
I used a structure like this:
 
Can you post the complete code so we can benchmark our versions against yours? I tried to implement something similar to what you suggest, but to my surprise it doesn't outperform my first naive approach.

Here is the result:
Time elapsed: 00:00:05.9503524, single iteration: 0,0059503524 ms // naive version
Time elapsed: 00:00:05.8370052, single iteration: 0,0058370052 ms // optimized version

And here is the code for my benchmark:
https://gist.github.com/yallie/5008959
Coordinator
Feb 21, 2013 at 9:37 PM
hmmm... look here:

// pre-setup for #2
var index = RootIndex.FromStrings(prefixes);
Benchmark("#2", iterations, () => sourceCode.IndexOfAny(3, chars, strings));

should the last parameter be 'index'?!
no wonder results are identical :)
Developer
Feb 21, 2013 at 9:43 PM
Damn, what a typo! :)

New results, compiled with /debug:

Time elapsed: 00:00:06.5206798, single iteration: 0,0065206798 ms
Time elapsed: 00:00:05.5773253, single iteration: 0,0055773253 ms

Compiled with /o:

Time elapsed: 00:00:05.7961817, single iteration: 0,0057961817 ms
Time elapsed: 00:00:04.7258132, single iteration: 0,0047258132 ms

That's about 15%...
Developer
Feb 21, 2013 at 9:46 PM
Again, many repeated searches will be progressively made over the same source string,
so we may do some preparation with it as well.
 
Why doing many searches?
To me, it looks like prefixed terminals can be extracted in a single pass.
Many calls to IndexOfAny, but each call processes its own range of text.
 
What about a plain array of 256 objects
 
Looks promising, but we really must experiment with that. Dictionary<T> is an optimized beast.
Coordinator
Feb 21, 2013 at 9:50 PM
About my version - here's the line:

// TODO: see if this can be optimized.

it's not much different from your original version, so nothing new to benchmark against

I'll look at your code later. one thing I see is that you repeat 10K times the same search from position 3, so you hit the same char/prefix each time - not very representative IMHO, it is not progressive scan. In your optimized version, you have a hashset and a dict for lookups - I don't think you really win much, 'cause dict lookups are quite expensive.
Coordinator
Feb 21, 2013 at 9:58 PM
'many repeated searches' - i meant from different, increasing position values, repeated calls to this 'findNext' method.
Dictionary is optimized beast - it maybe best dict in its class, but it does have non-reduceable overhead, so the question is - can we do something better. For example, it is known that for small sizes direct search in List is faster than dict lookup - so there's HybridDictionary in .NET - it is list for small counts, and then dynamically switches to real dict. (http://msdn.microsoft.com/en-us/library/system.collections.specialized.hybriddictionary.aspx). Something to think about.
Or, for example, you have Hashset<char> SingleChars, to check if char 'belongs'. What about joining all chars in a string and doing string.Contains(ch)?
Many things like that... like working with source text as array of chars...
Developer
Feb 22, 2013 at 5:44 AM
Edited Feb 22, 2013 at 5:45 AM
one thing I see is that you repeat 10K times the same search from position 3,
so you hit the same char/prefix each time - not very representative IMHO, it is not progressive scan.
 
Yes, I agree.

I've modified my methods to return a Tuple<int, int> (position + length) and rewritten the benchmark to run progressive scans iteratively. Now my 'optimized' version performs even slower than simple version. What a crap! :)
 
For example, it is known that for small sizes direct search in List is faster than dict lookup -
so there's HybridDictionary in .NET - it is list for small counts, and then dynamically switches to real dict.
 
I used to use HybridDictionary in .NET 1.0 days, but it still doesn't have its generic counterpart. Non-generic version with char keys will cause boxing overhead, that's why I didn't consider using it.
 
Or, for example, you have Hashset<char> SingleChars, to check if char 'belongs'.
What about joining all chars in a string and doing string.Contains(ch)?
 
No, I tried that. HashSet<char>.Contains(c) performs better than string.Contains(c), especially on large sets:

String, small. Time elapsed: 00:00:01.7959951, single iteration: 0,00017959951 ms
String, large. Time elapsed: 00:00:07.0341469, single iteration: 0,00070341469 ms
HashSet<char>, small. Time elapsed: 00:00:01.0829801, single iteration: 0,00010829801 ms
HashSet<char>, large. Time elapsed: 00:00:01.0054292, single iteration: 0,00010054292 ms

Here is a benchmark for HashSet<char>:
https://gist.github.com/yallie/5011260
Coordinator
Feb 22, 2013 at 7:15 AM
Edited Feb 22, 2013 at 7:20 AM
man, this hash/string test is quite f-d up, to the point of being meaningless. You use lambda to pass 'expression under test', and then call it in iteration loop. It is not just lambda, it's a whole closure, so you have an extra call/return included into the test. Plus, inside lambda you do a few more operations (devise index and make char lookup) - all this sh.t is comparable in terms of execution time to the function under test. So hell knows what you actually test and what results mean. You are trying to measure tiny piece, maybe a few instructions long - not a remote call to database. So every IL instruction matters.
In benchmarking, unlike in regular coding, repeating yourself and copy/pasting is better, and sometimes the only way to go. You should directly repeat 10 or even 20 simple calls inside the loop:

for(i = 0 to iterCount/20) {
v1 = str.contains(c1);
v2 = str.contains(c2);
. . . .
v20 = str.contains(c20)
}
that's might give you some realistic numbers. Fanciness and elegance is no use in benchmarking, sorry.

About hybridDictionary - I know it's old and ugly, and has no generic version. I did not suggest to use it, but tried to illustrate the point - for small sets direct search in list is faster than dict lookup by hash code. I expect a lot of sets in our case to be small, so this should be considered.
thanks for the effort, let's move on, we'll crack it, eventually
Roman

EDITED: fixed the sample code to use string.Contains.
Developer
Feb 22, 2013 at 7:43 AM
Edited Feb 22, 2013 at 7:44 AM
man, this hash/string test is quite f-d up, to the point of being meaningless.
 
Haha, bullshit :)
I know it has a little bit of overhead but it's not practically noticeable.
 
You use lambda to pass 'expression under test', and then call it in iteration loop. It is not just lambda, it's a whole closure,
 
Right, lambda is always a whole closure. The closure is created only once, when my Benchmark method is called. In the loop it's just a non-virtual call, not a disaster.
 
Plus, inside lambda you do a few more operations (devise index and make char lookup) - all this sh.t is comparable in terms of execution time to the function under test.
 
String.Contains(c) compared to indexed array access plus a % is a monster. Hundreds of CPU instructions compared to a few ones.
 
So hell knows what you actually test and what results mean. You are trying to measure tiny piece, maybe a few instructions long - not a remote call to database. So every IL instruction matters. In benchmarking, unlike in regular coding, repeating yourself and copy/pasting is better, and sometimes the only way to go. You should directly repeat 10 or even 20 simple calls inside the loop:
 
Well, here is the updated version, perhaps it might convince you:
https://gist.github.com/yallie/5011774

(I guess you might neatpick on calling IEnumerable<string>.Contains for both string and a HashSet<char>, but I'm almost certain that both string and HashSet<T> classes don't implement this method explicitly, so it's exactly the same thing as native method call).

The results are almost the same (HashSet runs from 2 to 10 times faster than string):

String, small Time elapsed: 00:00:01.2767976, single iteration: 0,012767976 ms
String, large Time elapsed: 00:00:05.2250165, single iteration: 0,052250165 ms
HashSet<char>, small Time elapsed: 00:00:00.5833501, single iteration: 0,005833501 ms
HashSet<char>, large Time elapsed: 00:00:00.5777248, single iteration: 0,005777248 ms
Coordinator
Feb 22, 2013 at 8:36 AM
Edited Feb 22, 2013 at 8:39 AM
i think this is even bigger non-sense, sorry. You cast string and hashset to IEnumerable<char> ???!!! And you still think that with this conversion, Hashset still uses lookup by Hash code?!! 'Contains' now comes from LINQ extensions - check it! In both cases, objects are converted to sequence, and converted means new enumerator created, which runs through list SEQUENTIALLY, essentially serializing the set. Don't forget extra multi-step call to enumerator (.MoveNext, current) which adds overhead in access - in addition to conversion to a 'sequence' per se. You are measuring efficiency of derived enumerators, not core objects functionality.

Man, just copy the method, and make one accept string, another hashset.

And let's try the third test - try sending charset as char[] array, with explicit loop (in a separate function for now).
Developer
Feb 22, 2013 at 8:49 AM
i think this is even bigger non-sense, sorry. You cast string and hashset to IEnumerable<char> ???!!!
 
Nope. I don't cast anything, I just call method via interface. It's a BIG difference.
 
And you still think that with this conversion, Hashset still uses lookup by Hash code?!!
 
No doubts. Try decompiling the code yourself.
 
In both cases, objects are converted to sequence, and converted means new enumerator created, which runs through list SEQUENTIALLY, essentially serializing the set. Don't forget extra multi-step call to enumerator (.MoveNext, current) which adds overhead in access - in addition to conversion to a 'sequence' per se. You are measuring efficiency of derived enumerators, not core objects functionality.
 
No, you're simply wrong. Nothing gets converted. Period. Both string and HashSet<char> classes ARE already sequences, are already IEnumerable<char>. When I call ((IEnumerable<char>)string).Contains, I call the same string.Contains method. When I call ((IEnumerable<char>)HashSet<char>).Contains, I still call the same HashSet<char>.Contains method.

About the enumerator — it's not called. These sequences are not being enumerated in my benchmark. I don't use a foreach loop.
 
Man, just copy the method, and make one accept string, another hashset.
And let's try the third test - try sending charset as char[] array, with explicit loop (in a separate function for now).
 
Here you are:
https://gist.github.com/yallie/5012163

And here are the results, still the same:

String, small. Time elapsed: 00:00:01.1362458, single iteration: 0,011362458 ms
String, large. Time elapsed: 00:00:05.1326987, single iteration: 0,051326987 ms
HashSet<char>, small. Time elapsed: 00:00:00.3934139, single iteration: 0,003934139 ms
HashSet<char>, large. Time elapsed: 00:00:00.3017178, single iteration: 0,003017178 ms
char[], small. Time elapsed: 00:00:01.2390625, single iteration: 0,012390625 ms
char[], large. Time elapsed: 00:00:01.8747489, single iteration: 0,018747489 ms

Array.Contains is faster that string.Contains, but much slower than HashSet<char>.Contains.
Coordinator
Feb 22, 2013 at 8:58 AM
in the old version, with IEnumerable<char>, inside Benchmark method, right-click over set.Contains method - it carries you to System.Linq.Enumerable class, extension method Contains. This contains knows nothing about efficient lookup inside Hashset (which is behind IEnumerable) - all it can do is go through the sequence, one-by-one. You probably looked at IL of the last version, with separate method versions.
Developer
Feb 22, 2013 at 8:58 AM
Oops, Roman, sorry! The magic behind all that is very different.
My arguments are all wrong because IEnumerable doesn't have Contains method :)
Coordinator
Feb 22, 2013 at 9:05 AM
char[] version is wrong - again, you're using Contains from Enumerable.cs. Please remove System.Linq from used namespaces. When you do, you'll see also that string.Contains is not a native method - it is LINQ extension. So we need to use IndexOf method instead - sorry, my bad.
For char[] version, pls use the following function :

bool ArrayContains(char ch, char[] inArray){
for(int i.. to inArray.Length)
if (ch == inArray[i]) return true;
return false;
}
Developer
Feb 22, 2013 at 9:06 AM
Edited Feb 22, 2013 at 9:07 AM
in the old version, with IEnumerable<char>, inside Benchmark method, right-click over set.Contains method - it carries you to System.Linq.Enumerable class, extension method Contains. This contains knows nothing about efficient lookup inside Hashset (which is behind IEnumerable) - all it can do is go through the sequence, one-by-one. You probably looked at IL of the last version, with separate method versions.
 
Well, not exactly! LINQ has a special treatment for that exact case (for careless guys like me):
 
// in static class Enumerable
public static bool Contains<TSource>(this IEnumerable<TSource> source, TSource value)
{
    ICollection<TSource> is2 = source as ICollection<TSource>;
    if (is2 != null)
    {
        return is2.Contains(value); // this guy uses native HashSet<char>.Contains
    }
    return source.Contains<TSource>(value, null);
}
Coordinator
Feb 22, 2013 at 9:06 AM
going to bed, it's too late already here. Thanks, and let's continue tomorrow.
Roman
Developer
Feb 22, 2013 at 9:16 AM
Edited Feb 22, 2013 at 9:18 AM
char[] version is wrong - again, you're using Contains from Enumerable.cs. Please remove System.Linq from used namespaces. When you do, you'll see also that string.Contains is not a native method - it is LINQ extension. So we need to use IndexOf method instead - sorry, my bad.
 
Ah shit! I should have used Visual Studio to write these benchmarks... Notepad shows its ugly face :)
I was assuming that string.Contains is a native method.

The updated results are very different:
https://gist.github.com/yallie/5012299

String, small. Time elapsed: 00:00:00.7501351, single iteration: 0,0015002702 ms
String, large. Time elapsed: 00:00:03.0149584, single iteration: 0,0060299168 ms
HashSet<char>, small. Time elapsed: 00:00:02.0382007, single iteration: 0,0040764014 ms
HashSet<char>, large. Time elapsed: 00:00:01.6572213, single iteration: 0,0033144426 ms
char[], small. Time elapsed: 00:00:00.6324895, single iteration: 0,001264979 ms
char[], large. Time elapsed: 00:00:04.1838394, single iteration: 0,0083676788 ms
Coordinator
Feb 22, 2013 at 4:41 PM
Edited Feb 22, 2013 at 5:05 PM
So we have a winner, for small sets - it is char[] with direct loop. For big sets Hashset is a way to go, it has the same lookup time irrelevant of size.
Now, an important fact to find out is a break-even point, the number to to switch to Hashset.

Going back to original search in string, let's see which is faster
  1. Using source.IndexOfAny(chars) + hash/array lookup of char's info (full prefix + terminal).
  2. Using plain loop over string chars with nested loop/lookup in first chars. Here we get the final (prefix + terminal) info directly inside the loop. One variation to try is to convert source string to char array at the beginning and work with this array, instead of doing source[i] lookup in string.
About Enumerable.Contains - didn't know that it is smart and actually checks for ICollection, good to know. It always helps to look in the source.
thanks
Roman

PS pls be careful with swearing, it is public forum after all, 400+ people watching. At least don't fully spell sh.. words
Developer
Feb 22, 2013 at 7:05 PM
Edited Feb 22, 2013 at 7:06 PM
PS pls be careful with swearing, it is public forum after all, 400+ people watching. At least don't fully spell sh.. words
 
Sure, no problem.
 
lookup table - should it be a dictionary? Dictionary lookup is complex operation; hashing the char (how expensive it is?), then deriving the index and doing 'bucket' lookup in internal array(s). What about a plain array of 256 objects,
 
Just tried to write CharDictionary<T> to replace Dictionary<char, T>:
https://gist.github.com/yallie/5016153

Its more memory-consumptive, but performs ~4 times faster. Here are the results:
 
Dictionary. Time elapsed: 3.8990184, single iteration: 0,00038990184 ms
CharDictionary. Time elapsed: 0.9705137, single iteration: 9,705137E-05 ms
Coordinator
Feb 22, 2013 at 8:44 PM
Edited Feb 22, 2013 at 8:47 PM
Looks good, but not exactly what I meant. You do not need UnicodeTable that eats all this memory. In 99.9% of cases the chars we lookup will be special characters (like { + - / ) so they'll always be in Ascii range. Instead, make AsciiTable contain an element (array in fact):

class CharLookupData {
public Tuple<char, T>[] Data;
}
public CharLookupData[] AsciiTable;

In most cases Data array will have a single element. When looking up, get the tuple array in AsciiTable, then compare the char and return T value if it matches (repeat if array has more than one element).

Edited: fixed the declaration code, introduced CharLookupData;
Developer
Feb 24, 2013 at 1:00 AM
Edited Feb 24, 2013 at 1:07 AM
You do not need UnicodeTable that eats all this memory. In 99.9% of cases the chars we lookup will be special characters (like { + - / ) so they'll always be in Ascii range.
 
It's not too greedy: UnicodeTable is a jagged array, not two-dimensional array. For 99.9% cases, UnicodeTable has size of 256*sizeof(T): if T is a reference type on a 32-bit machine, then it's an extra kilobyte. Every new character range with a fresh value of the high-order byte (for example, cyrillic chars) adds an extra kilobyte to my dictionary. A language with english and localized keywords will use a CharDictionary of approx. 3 kb in size (or 6 kb on a 64-bit machine).
 
Instead, make AsciiTable contain an element (array in fact):
 
Ok, I added CompactCharDictionary<T> that behaves according to your proposal.
https://gist.github.com/yallie/5022202

When compiled with /debug switch, it performs slower than Dictionary<char, T>:
 
Dictionary. Time elapsed: 4.3209235, single iteration: 0,00043209235 ms
CharDictionary. Time elapsed: 2.8315795, single iteration: 0,00028315795 ms
CompactCharDictionary. Time elapsed: 6.4523438, single iteration: 0,00064523438 ms
 
When compiled with /o switch, it performs ~3 times faster than Dictionary<char, T>, but ~1.5 times slower than CharDictionary<T>:
 
Dictionary. Time elapsed: 4.1239435, single iteration: 0,00041239435 ms
CharDictionary. Time elapsed: 1.0770744, single iteration: 0,00010770744 ms
CompactCharDictionary. Time elapsed: 1.4382816, single iteration: 0,00014382816 ms
Coordinator
Feb 24, 2013 at 8:05 AM
well, your test is a bit biased, as usual :) (I'm ready to agree with your suggestion, but just to be fair and make sure we really compared correctly)
First, you use System.Tuple for Compact dictionary. Look inside and you'll see it is a class with PROPERTIES. Replace it with the following:
public class Tuplex<T1, T2> {
  public T1 Item1;
  public T2 Item2;
  public Tuplex(T1 item1, T2 item2) {
    Item1 = item1;
    Item2 = item2;
  }
}
and you'll see execution time drops by around 30% (struct is slower by around 10%). So property accessors do have a price!
One other problem in Benchmark method you use only chars that are in the dict. (you were aware of this - I see your comment that missing entry may cause IndexOutOfRange exc). In Irony scanner in typical the positives would be only around 10% of total dict lookups.
So I changed the input string to include different characters and left only one 1 character that is used in Benchmark so in tests only 1 in 9 would be positive. That immediately blew up in a few places. I changed Dictionary benchmark to use TryGetValue. That made its results much worse, and it is out of competition.
Also, I switched to VS 2012/.NET 4.5 - this speeded things up a bit, and narrowed the gap between charDict and compactCharDict.
But.. surprise! This all does not matter much. When I started running .exe file directly (with release build), things speeded up by a factor of 10. I even had to increase iteration count 10x to 100 million, to get to execution time around 1 second each test, similar to your times. I guess you were running from under VS? (I don't think you have old 486 from last century :). If you did, then your numbers don't mean much. I have seen this before, VS slows down execution by factor of 5 to 10.
But comparative results are the following: Char dictionary is 5 times faster than compact dict (and stardard Dict is about 20 times slower). It appears two simple array lookups with one conditional are hard to beat. It looks like we'll be using this method.
(I'm still suspicious that compactDict is at artificial disadvantage here - there's an extra wrapping into this CharLookupData, and maybe we can get rid of it - but it does not matter much)
I'm a bit puzzled by perf numbers. It looks like a single iteration (10 accesses to charDict) takes about 5 nanoseconds, which on my 2.6GHz machine is what - 10-20 instructions? that seems not enough for 10 calls. I suspected optimizer simply removes the calls (as we don't use the result), so I put the access count in dictionaries, and then printed it - no, access count is correct, things slowed down a bit, but not much. Anybody? how to interpret the numbers - is it realistic?
thanks
Roman

code is here:
https://gist.github.com/rivantsov/5022903
Developer
Feb 24, 2013 at 7:03 PM
Edited Feb 24, 2013 at 7:40 PM
But.. surprise! This all does not matter much. When I started running .exe file directly (with release build), things speeded up by a factor of 10.
 
Yes, I'm aware of that. Perhaps VS is doing some kind of instrumentation for profiling/debugging purposes.
 
I even had to increase iteration count 10x to 100 million, to get to execution time around 1 second each test, similar to your times. I guess you were running from under VS?
 
No, I'm too lazy to boot up VS to write such a small snippets :)
I use Far Manager to edit these benchmarks and compile them with csc /o.
 
(I don't think you have old 486 from last century :).
 
No, but I developed my benchmarks with smaller iteration counts. For the final measurements I increased these constants. Looks like I've uploaded some of my snippets with smaller iteration counts.
 
Char dictionary is 5 times faster than compact dict (and stardard Dict is about 20 times slower).
 
Ah, forgot to mention!
I've got 64-bit machine, and results are very different between x64 and x32 versions of the .NET runtime.

Here are the results of your version in 64-bit mode (compiled with /o):
 
Dictionary. Time elapsed: 36.2369814, single iteration: 0,000362369814 ms.
CharDictionary. Time elapsed: 12.3337299, single iteration: 0,000123337299 ms // 2.93x
CompactCharDictionary. Time elapsed: 11.4105652, single iteration: 0,000114105652 ms // 3.17x
 
And in 32-bit mode (compiled with /o /platform:anycpu32bitpreferred):
 
Dictionary. Time elapsed: 34.2510101, single iteration: 0,000342510101 ms.
CharDictionary. Time elapsed: 1.5900425, single iteration: 1,5900425E-05 ms // 21.54x
CompactCharDictionary. Time elapsed: 8.6484248, single iteration: 8,6484248E-05 ms // 3.96x
 
The difference is dramatic. Looks like 64-bit JIT is far less smart than its 32-bit counterpart. It also arises a new question: can we develop a version specifically tuned for 64-bit .NET runtime? We could then select the proper dictionary class based on IntPtr.Size. UPD: looks like CompactCharDictionary is a bit faster on a 64-bit runtime, so we can try to fine-tune it.
 
I'm a bit puzzled by perf numbers. It looks like a single iteration (10 accesses to charDict) takes about 5 nanoseconds, which on my 2.6GHz machine is what - 10-20 instructions? that seems not enough for 10 calls. I suspected optimizer simply removes the calls (as we don't use the result), so I put the access count in dictionaries, and then printed it - no, access count is correct, things slowed down a bit, but not much. Anybody? how to interpret the numbers - is it realistic?
 
I think that we see the results of optimization by JIT-compiler and modern multicore CPU. According to Reflector, C# compiler doesn't do anything tricky here to optimize the generated code.

JIT-compiler might probably inline get_Item method calls, and CPU might run multiple independent operations in parallel on several instruction pipelines (also, it can change the order of execution to optimize load). So it's hard to say what exactly is happening here, I'm afraid...
Developer
Feb 24, 2013 at 7:30 PM
Edited Feb 24, 2013 at 7:30 PM
I'm still suspicious that compactDict is at artificial disadvantage here - there's an extra wrapping into this CharLookupData, and maybe we can get rid of it - but it does not matter much
 
Yes, looks like it doesn't matter much.
https://gist.github.com/yallie/5025461
Coordinator
Feb 26, 2013 at 6:02 PM
Ok, cool. I'm still unconvinced about 'too good' benchmarks explanations. It appears that if processor heavily parallizes our get_value['x'] calls inside benchmark loops, then what we are getting is distorted picture. We have 10 calls inside the benchmark loop, and what we're getting is execution time of 2 or 3 consecutive calls (10 calls executed on 4 pipelines, let's assume). This won't happen in real app - there would be no 10 calls to the same get[] method not using the return value - so no chance for such an efficient optimization. On the other hand, 10 calls in a row are introduced to reduce the impact of containing loop. So benchmarks are somewhat misleading - can we do something to suppress parallization of these calls? But one thing for sure, array-based dictionary is a better choice for very sparse sets that we expect.

Going back to original search in string, let's see which is faster
  1. Using source.IndexOfAny(chars) + lookup of char's info (full prefix + terminal). + checking full prefix at current position (if prefix is longer than 1 char).
  2. Using plain loop over string, getting current char with 'string[index] with lookup in first chars dict. Here we get the final (prefix + terminal) info directly inside the loop. + checking full prefix at current position
  3. Same as #2, but with conversion of source string to char array at the beginning and using chars[index]. + checking full prefix at current pos
Feb 27, 2013 at 10:31 PM
Guys,

I find your discussion fascinating and slightly intimidating (because my level of C# sophistication is pretty much at your toes if not below).

So, shut me up quickly if my thought is completely stupid.

Here it is.
The idea could work if it is possible to look at the input stream as a bit pattern (much like the zip utilities do).

When the grammer is initially defined, take the most frequently used tokens (those that Roman outlines) and convert them to a bit pattern that can hold the largest number of bits for some numeric type (Double, Long, Longer - whatever - you guys should know better) - the idea here is to be able have the simplest comparison with minimal to no iteration (depending on the length of token.
Store these guys in some array (or the quickest lookup structure) - this gets done once upfront at grammar definition time.

Then when parsing use the next chunk of bits that should match some token for comparison between bit patterns (those in the array of tokens to look up vs the next token in the bit stream.)

Again, there is insufficient detail for quick implementation and may be a total bogus...

Bey
Coordinator
Feb 28, 2013 at 6:22 AM
Edited Feb 28, 2013 at 6:23 AM
Well, thanks for the suggestion, but really things are not that straight in text handling. We can easily get text as a stream of bits but.. UNICODE! It is not byte per char, and not even always 2 bytes per char... sorry, char is always 2 bytes but the UNICODE symbol might take 2 2-byte chars!
(see here: http://csharpindepth.com/Articles/General/Unicode.aspx , scroll down to 'Surrogate pairs' section)
So when running through bit stream which is UNICODE text, you have to know, byte-by-byte, what you're looking at, where symbol begins and where it ends. Unlike zipper - which does not care about meaning of bits, just searches for repeated patterns.
Therefore, you have to use intelligent piece of code that knows how to interpret UNICODE stream - and this is .NET String class. No way to go around it. Note that although keywords that we search for are most likely regular, standard English based chars (with 2 bytes), the source stream may include any stuff - comment or string literal in Chinese for example.
Now the problem is how to do a particular search over unicode text efficiently - using .NET string and char classes.
Hope this clears the issue.

Roman
Feb 28, 2013 at 2:15 PM
Edited Feb 28, 2013 at 5:32 PM
Thanks for taking the time to respond.
Silly me, I thought (until reading the link) that 16 bits would suffice for humanity's character repertoire.
Hurray for diversity (even if it will never be used as the article suggests.)

Another question / thought.
Why wouldn't you arrange the tokens in descending 'potential prevalence'?
In other words, suppose the space character comprise of 60% of the tokens and is the most prevalent one - I would place it at the top of the data structure for tokens (as it is both, most likely to occur and costs very little to detect.)

I understand one cannot tell exactly how prevalent some tokens are in a given input but, certainly, one can make some good guesses.
Also, there's no one better than you to introduce 'learning mode' to the scanner, feed it with some sample inputs and come up with a reasonable order.

Or is it crazier than my prevous thought?
Coordinator
Mar 2, 2013 at 5:46 AM
No, it's not crazy, and it's quite reasonable to think about optimizations like these. But, if you use hashset to lookup characters, there's no order at all, and no matter of frequency the lookup time is roughly the same. If we are using arrays - then yes, the order of chars in array does matter, but... these kind of improvements should be carefully measured with all gains and losses taken into account:
  1. We are discussing operating at very 'low' level, optimizing primitive loops and measuring primitive operations against each other. So introducing even a single extra operator into the loop can considerable impact the performance. This is the case for 'auto-learning', on the fly. By the way, surely space should be considered a very special character and should be checked first anyway.
  2. Even if there's clear perf gain in certain cases, you should measure it against the factor of more complicated parser/parser construction routine. If the gain is just 1%, it's not worth introducing extra optimizer class and cs file. Any extra element in overall framework is an extra cognitive load, source of questions 'do I need it?' and potential misuse, which might bring a lot of trouble.
    Irony in my opinion became a bit too complicated lately (and messy), time for some refactoring, so one of my priorities now is making it more understandable and easier to get into the internals. As any engineering work, building a framework is finding the best compromises, not single 'perfect' solutions.
    So, if in the end there's a uniform and simple enough way to presort chars in arrays, or any other things - surely we'll go for this.
    Roman
Nov 8, 2013 at 6:49 AM
Hi!

May I ask for an update on this entire rework issue (use GLR, new scanner, better support for incomplete syntax etc). I really think it's a great idea, as you framework absolutely rules, but you've stopped the discussions here apruptly in march, an I wander about the status :-)

regards,
Max
Coordinator
Nov 9, 2013 at 4:27 PM
hi
I got some progress in designing, outlining and even prototyping some stuff, but got sidetracked for the last 3-4 months, really busy with other projects; VITA project is taking a lot of time with really urgent issues; just made another release - got custom LINQ engine hacked out, support for MySql and lot of other stuff. This was and is more urgent in my list of priorities - that's what we use at work, for production apps. Ah, was in home renovation project for 2+ months, not much open source, just painting walls and that kinda stuff; now done, enjoying my like-new home.
I hope to get back to Irony at evening times soon, when other pressures ease. Also, somewhere in VITA project future there's an integration with Irony, for parsing custom search queries; so at this time will definitely start hacking Irony regularly. Sorry for no good news, thanks for caring!
Roman