Google/FTS search by Michael Coles in .net 3.5

Aug 28, 2009 at 1:16 PM

Hi,

I am having a problem getting a .dll to work properly with the newest Irony dll.

I am using Mikes Coles SearchGrammar and ConvertQuery functions and the newest Irony dll. I am using C# in Visual Studio Pro 2008 SP1. When done this will work on .NET 3.5. There is one function which accepts a google search string and returns the FTS equivalent as a string.

I have a test app in my C# project and it is returning: "Error in string literal [Phrase]: No start/end symbols specified.".

I would love an idea of what to look for.

Here is the small app with a console app that follows for testing:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Irony.Compiler;

namespace Search
{
    public class SearchGrammar : Grammar
    {
        SearchGrammar _grammar;
        LanguageCompiler _compiler;

        public string FTS(string GQry)
        {
            _grammar = new SearchGrammar();
            _compiler = new LanguageCompiler(_grammar);
            Irony.StringSet errors = _compiler.Grammar.Errors;
            if (errors.Count > 0)
            {
                string errs = "Startup Error for " + GQry + ": ";
                foreach (string err in _compiler.Grammar.Errors)
                {
                    errs += err + "\r\n";
                }
                return errs;
            }

            try
            {
                string strFTS = null;
                AstNode root = _compiler.Parse(GQry.ToLower());
                if (!CheckParseErrors()) return "Parse Error for " + GQry;
                strFTS = SearchGrammar.ConvertQuery(root, SearchGrammar.TermType.Inflectional);

                return strFTS;
            }
            catch (Exception ex)
            {
                System.Diagnostics.Debug.WriteLine(ex.ToString());
                return "Parse Exception for " + GQry;
            }

        }

        public enum TermType
        {
            Inflectional = 1,
            Thesaurus = 2,
            Exact = 3
        }


        public SearchGrammar()
        {
            // Terminals
            var Term = new IdentifierTerminal("Term", "!@#$%^*_'.?", "-!@#$%^*_'.?0123456789");
            // The following is not very imporant, but makes scanner recognize "or" and "and" as operators, not Terms
            // The "or" and "and" operator symbols found in grammar get higher priority in scanning and are checked
            // first, before the Term terminal, so Scanner produces operator token, not Term. For our purposes it does
            // not matter, we get around without it. 
//            Term.Priority = Terminal.LowestPriority;
            var Phrase = new StringLiteral("Phrase");

            // NonTerminals
            var OrExpression = new NonTerminal("OrExpression");
            var OrOperator = new NonTerminal("OrOperator");
            var AndExpression = new NonTerminal("AndExpression");
            var AndOperator = new NonTerminal("AndOperator");
            var ExcludeOperator = new NonTerminal("ExcludeOperator");
            var PrimaryExpression = new NonTerminal("PrimaryExpression");
            var ThesaurusExpression = new NonTerminal("ThesaurusExpression");
            var ThesaurusOperator = new NonTerminal("ThesaurusOperator");
            var ExactOperator = new NonTerminal("ExactOperator");
            var ExactExpression = new NonTerminal("ExactExpression");
            var ParenthesizedExpression = new NonTerminal("ParenthesizedExpression");
            var ProximityExpression = new NonTerminal("ProximityExpression");
            var ProximityList = new NonTerminal("ProximityList");

            this.Root = OrExpression;
            OrExpression.Rule = AndExpression
                              | OrExpression + OrOperator + AndExpression;
            OrOperator.Rule = Symbol("or") | "|";
            AndExpression.Rule = PrimaryExpression
                               | AndExpression + AndOperator + PrimaryExpression;
            AndOperator.Rule = Empty
                             | "and"
                             | "&"
                             | ExcludeOperator;
            ExcludeOperator.Rule = Symbol("-");
            PrimaryExpression.Rule = Term
                                   | ThesaurusExpression
                                   | ExactExpression
                                   | ParenthesizedExpression
                                   | Phrase
                                   | ProximityExpression;
            ThesaurusExpression.Rule = ThesaurusOperator + Term;
            ThesaurusOperator.Rule = Symbol("~");
            ExactExpression.Rule = ExactOperator + Term
                                 | ExactOperator + Phrase;
            ExactOperator.Rule = Symbol("+");
            ParenthesizedExpression.Rule = "(" + OrExpression + ")";
            ProximityExpression.Rule = "<" + ProximityList + ">";

            MakePlusRule(ProximityList, Term);

            RegisterPunctuation("<", ">", "(", ")");

        }


        public static string ConvertQuery(AstNode node, TermType type)
        {
            string result = "";
            // Note that some NonTerminals don't actually get into the AST tree, 
            // because of some Irony's optimizations - punctuation stripping and 
            // node bubbling. For example, ParenthesizedExpression - parentheses 
            // symbols get stripped off as punctuation, and child expression node 
            // (parenthesized content) replaces the parent ParExpr node (the 
            // child is "bubbled up").
            switch (node.Term.Name)
            {
                case "OrExpression":
                    result = "(" + ConvertQuery(node.ChildNodes[0], type) + " OR " +
                        ConvertQuery(node.ChildNodes[2], type) + ")";
                    break;

                case "AndExpression":
                    AstNode tmp2 = node.ChildNodes[1];
                    string opName = tmp2.Term.Name;
                    string andop = "";

                    if (opName == "-")
                    {
                        andop += " AND NOT ";
                    }
                    else
                    {
                        andop = " AND ";
                        type = TermType.Inflectional;
                    }
                    result = "(" + ConvertQuery(node.ChildNodes[0], type) + andop +
                        ConvertQuery(node.ChildNodes[2], type) + ")";
                    type = TermType.Inflectional;
                    break;

                case "PrimaryExpression":
                    result = "(" + ConvertQuery(node.ChildNodes[0], type) + ")";
                    break;

                case "ProximityList":
                    string[] tmp = new string[node.ChildNodes.Count];
                    type = TermType.Exact;
                    for (int i = 0; i < node.ChildNodes.Count; i++)
                    {
                        tmp[i] = ConvertQuery(node.ChildNodes[i], type);
                    }
                    result = "(" + string.Join(" NEAR ", tmp) + ")";
                    type = TermType.Inflectional;
                    break;

                case "Phrase":
                    result = '"' + ((Token)node).ValueString + '"';
                    break;

                case "ThesaurusExpression":
                    result = " FORMSOF (THESAURUS, " +
                        ((Token)node.ChildNodes[1]).ValueString + ") ";
                    break;

                case "ExactExpression":
                    result = " \"" + ((Token)node.ChildNodes[1]).ValueString + "\" ";
                    break;

                case "Term":
                    switch (type)
                    {
                        case TermType.Inflectional:
                            result = ((Token)node).ValueString;
                            if (result.EndsWith("*"))
                                result = "\"" + result + "\"";
                            else
                                result = " FORMSOF (INFLECTIONAL, " + result + ") ";
                            break;
                        case TermType.Exact:
                            result = ((Token)node).ValueString;

                            break;
                    }
                    break;

                // This should never happen, even if input string is garbage
                default:
                    throw new ApplicationException("Converter failed: unexpected term: " +
                        node.Term.Name + ". Please investigate.");

            }
            return result;
        }

        private bool CheckParseErrors()
        {
            if (_compiler.Grammar.Errors.Count == 0)
                return true;

            return false;
        }
    }
}

Console App:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Search;

namespace ConsoleApplication1
{
    class Program
    {
        /// <summary>
        /// The main entry point for the application.
        /// </summary>
        [STAThread]
        static void Main()
        {
            SearchGrammar cls = new SearchGrammar();
            string strRes = cls.FTS("toast -butter");
            Console.WriteLine(strRes.ToString());
        }
    }
}

Coordinator
Aug 28, 2009 at 3:33 PM

These 2 are definitely out of sync. The latest source version of Irony contains SearchGrammar in Samples assembly - this is updated Mike's grammar, you should use this.

Let me know how it works, I did not have a chance to test it with real database queries

Aug 28, 2009 at 4:18 PM
Hi rivantsov,

Yes, I saw that in the discussion on the article site.

I have irony-20817 source building successfully.

But when I search the 020.Irony.Samples project, or even the whole solution, I don't see SearchGrammar.

Sorry to be thick, I'm guessing that there is something there I am missing...

Bill

From: rivantsov

These 2 are definitely out of sync. The latest source version of Irony contains SearchGrammar in Samples assembly - this is updated Mike's grammar, you should use this.

Let me know how it works, I did not have a chance to test it with real database queries

Coordinator
Aug 28, 2009 at 4:37 PM

ah, you're talking about source attached to the article? you should go to "Source code" page on this site and download the latest source zip. Just checked, the search grammar is there. You can see it without downloading by simply browsing the sources in the changeset

 

 

 

Coordinator
Aug 28, 2009 at 4:59 PM

correction - I understand now the source of trouble - you were trying to use not source from article but alpha-version in Downloads page on this site. This doesn't work either, use latest from Source code page

Aug 28, 2009 at 6:23 PM

Yes, you are correct. I am now working with irony-32357 instead of irony-20817.

I have some small problem modifying my code to work with your newest library.

This is what I have so far, but it does not work. Can you offer insight? Also, can you tell me how best to return a simple error message from the compiler class?

When this is done, I will post it so that others will have a working version. I will also add a comment on Mike Cole's article discussion, pointing people back here.

        SearchGrammar _grammar;
        Compiler _compiler;
        CompilerContext _compilerContext;
        ParseTree _parseTree;

        public string FTS(string GQry)
        {
            try
            {
                _grammar = new SearchGrammar();
                _compiler = new Compiler(_grammar);
                _compilerContext = new CompilerContext(_compiler);
                _parseTree = null;
                string strFTS = null;
                _compiler.Parse(_compilerContext, GQry.ToLower(), "<source>");
                _parseTree = _compilerContext.CurrentParseTree;
                strFTS = SearchGrammar.ConvertQuery(_parseTree.Root, SearchGrammar.TermType.Inflectional);

                return strFTS;
            }
            catch (Exception ex)
            {
                System.Diagnostics.Debug.WriteLine(ex.ToString());
                return "Parse Exception for " + GQry;
            }

        }

Coordinator
Aug 28, 2009 at 6:38 PM

what's the error you are getting?

 

Aug 28, 2009 at 6:48 PM
Object reference not set to an instance of an object.

From: rivantsov

what's the error you are getting?

Aug 28, 2009 at 7:27 PM

Error: Object reference not set to an instance of an object.

Here is my current code with error checking. Thanks for any suggestions.

 

        SearchGrammar _grammar;
        Compiler _compiler;
        CompilerContext _compilerContext;
        ParseTree _parseTree;


        public string FTS(string GQry)
        {
            try {
                _grammar = new SearchGrammar();
                _compiler = new Compiler(_grammar);
                _compilerContext = new CompilerContext(_compiler);
                _parseTree = null;
                string strFTS = null;

                try {
                    _compiler.Parse(_compilerContext, GQry.ToLower(), "<source>");
                }
                catch (Exception ex) {
                    throw new ApplicationException("Parse error: " + ex.Message.ToString());
                }
                finally {
                    _parseTree = _compilerContext.CurrentParseTree;
                    ShowCompilerErrors();
                }

                strFTS = SearchGrammar.ConvertQuery(_parseTree.Root, SearchGrammar.TermType.Inflectional);

                return strFTS;
            }
            catch (Exception ex) {
                System.Diagnostics.Debug.WriteLine(ex.Message.ToString());
                return "Parse Exception for " + GQry + " Error: " + ex.Message.ToString();
            }

        }

        private void ShowCompilerErrors()
        {
            if (_parseTree == null || _parseTree.Errors.Count == 0) return;
            string errs = null;
            foreach (var err in _parseTree.Errors)
            {
                errs = err.Location.ToString() +
                       "Error: " + err.ToString() +
                       "Parser State: " + err.ParserState.ToString();

            }
            throw new ApplicationException(errs);
        }

Coordinator
Aug 28, 2009 at 8:20 PM

For now, without running the code I see one problem: you don't check errors after calling Parse method, and this might be your problem. Parser does not throw exceptions when it sees syntax error but tries to recover and parse further to uncover all errors - that's the behavior you see in c# compiler for example. So even if Parse method finished without exception, you should first check for errors; if there were errors, the root node might not be created at all, and that's why it blows up later in ConvertQuery

See if it helps; if it doesn't I wil later tonight try to run it, right now can't do it.

Coordinator
Aug 28, 2009 at 8:47 PM

correction: you do check for errors, as I see now, but you should also check if Root node was actually created; it might not if there are errors

Sep 1, 2009 at 5:43 PM

I am also having some trouble with the updated SearchGrammar that is included in the samples of the latest source code release (32357).

The things that are not working are:

  • Proximity
  • Exclusion
  • Wildcard

The source download on Michael Coles's article doesn't have these problems. It uses an older version of Irony, also.

Here are some screenshots of the exceptions that I receive.

Can you help me determine why I am having these problems?

Coordinator
Sep 1, 2009 at 5:52 PM

Ok, thank you very much, that's the feedback I needed. I will investigate asap and fix it.

Sep 1, 2009 at 6:04 PM

Hi there,

I also tracked the exclusion symbol (-) to the same place where token is null.

I notice that proximity does not work also. It DOES work if you use the NEAR keyword, but does not work if you use the * (asterisk) symbol.

I did not notice yet that wildcard didn't work. Thanks ronnieoverby! Also, very nice jpgs.

best regards,
Bill Clark

From: rivantsov

Ok, thank you very much, that's the feedback I needed. I will investigate asap and fix it.

Read the full discussion online.

To add a post to this discussion, reply to this email (irony@discussions.codeplex.com)

To start a new discussion for this project, email irony@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com

Sep 1, 2009 at 8:53 PM

I also notice that any punctuation in a word causes a parse error, such as a comma, period, hyphen or single quote.

Coordinator
Sep 2, 2009 at 5:24 PM

Just uploaded the fixed solution. There is a small SearchGrammarTest console app now for testing this grammar. Try it, let me know if something is still broken

Roman

Sep 2, 2009 at 6:25 PM

Just a couple things:

a dash in a word acts as an exclusion ie: bread-basket returns ( FORMSOF (INFLECTIONAL, bread)  AND NOT  FORMSOF (INFLECTIONAL, basket) )

this may just be a bug in the console app, but bread,butter returns Error: Invalid character: ',' at 1, 6

thanks!

Bill

Coordinator
Sep 2, 2009 at 6:34 PM

What's the correct interpretation of the comma? I don't see it handled in Mike's original grammar.

For dash - will be looking into this...

Coordinator
Sep 2, 2009 at 6:40 PM

For a dash, just add a dash to the second parameter in term constructor in CreateTerm method:

<font size="2">

 

</font>

IdentifierTerminal term = new IdentifierTerminal(name, "!@#$%^*_'.?-", "!@#$%^*_'.?0123456789");

I will fix it here and it will be in the next upload

Sep 2, 2009 at 6:47 PM

I think punctuation inside a word should be treated as part of the word.

This mostly works now:

bread.butter returns FORMSOF (INFLECTIONAL, bread.butter)

bread'butter returns FORMSOF (INFLECTIONAL, bread'butter)

but bread,butter returns the error. Also bread"butter returns a similar error.

 

Coordinator
Sep 2, 2009 at 6:51 PM

Well, I'm not sure about this - about any punctuation inside word.

For dash, i can see the cases when dashed word should be treated as a single word, like hyphened last names; same for apostroph (Irish last names like O'Hara)

but for comma?

In any case, just add extra chars you want to second parameter of IdentifierTerminal

 

Sep 2, 2009 at 7:50 PM

OK on punctuation inside a word.

When I add a dash to 2nd param of IdentifierTerminal, I get this:

Enter query>smith-jones
Result:
( FORMSOF (INFLECTIONAL, smith)  AND  FORMSOF (INFLECTIONAL, -jones) )

Should it return (FORMSOF(INFLECTIONAL, smith-jones)) ?

BTW: Thanks for all your help and the excellent code!

 

Coordinator
Sep 2, 2009 at 8:05 PM

Are you sure you added dash to the second, and not third parameter (- which would be wrong)?

it all works here...

Sep 2, 2009 at 8:10 PM

You called it. It works when the dash is in the 2nd param (not the 3rd).

Do you want to add a message to Michael Cole's article discussion describing the latest updates or should I?

Thanks again for everything!

 

Coordinator
Sep 2, 2009 at 8:18 PM

Please add it to article's log. Did you try it with database? or you think you can be confident it would work based on testing with my  console app?

thank you - for helping me fix it

Roman

Coordinator
Sep 2, 2009 at 8:20 PM

BTW, it would be nice to post the update of entier solution, including Michael's test UI app.

Sep 2, 2009 at 8:45 PM

Yes, I have tried it with a small dataset with 5 tables and 20 fields set up with fts indexes. It works.

My C# solution compiles to a Search.dll that calls the Irony dll. I am calling it from an aspx page. If I can, I will post it here as a zip.

First, I will have to confirm with my boss that it is ok for me to do that.

Bill

Coordinator
Sep 2, 2009 at 11:59 PM

One little advice, just to make sure you're aware of this.

To avoid rebuilding parser data (LanguageData) on each request in web app, you should build it once and save in server-wide cache, probably Application object or whatever is there for shared object. Initial data construction takes much more time (10-s of milliseconds) compared to actually parsing and converting the query (microseconds). Language data is immutable (well, almost, there are some mutable but thread-safe parts), so it is safe to share it between multiple parsers running on different threads.

 

 

Sep 3, 2009 at 11:30 AM

Hi Roman,

Thanks for the tip on caching. I haven't done this lately. I am working now in .net 3.5.

I suppose, I would split my function into 2. One function would load the grammar and parser objects.

Then another function would call the ConvertQuery function each time a translation is required.

I think I need to also create a signed dll so I have a Public Key token to use in web.config.

Finally, I think I need to register the loader function in the GAL using gacutil.

Does this seem correct?

Bill

Coordinator
Sep 3, 2009 at 7:10 PM
Edited Sep 4, 2009 at 3:36 PM

No, you misunderstood. The first part is correct - you split into two methods.

You don't need signed dll or mess with GAC. I'm not talking about global assembly cache - this is for cashing assemblies. I meant to cache the object instance of LanguageData inside Web server cache.

I think we have even easier solution. You can the following static singleton property to SearchGrammar class:

   private static LanguageData _languageInstance;

   public static LanguageData LanguageInstance {
      get {
          if (_languageInstance == null) {
             lock(typeof(SearchGrammar) {
                if (_languageInstance == null)
                    _languageInstance = new LanguageData(new SearchGrammar());
             }//lock

        }//if
        return _languageInstance;
      }//get

   }//property

(This double-checking for null and lock in between is a standard "thread-safe singleton pattern")

Then you can create a Parser on each request:

    var parser = new Parser(SearchGrammar.LanguageInstance);

In this case you will have a shared instance of language data per app domain. ASP.NET may create several domains on one server (possibly, not sure) but this small duplication is OK i think. The main point I think is that ASP.NET would keep domain alive for multiple requests, so static data will remain intact, and LanguageData can be reused again when request is processed in the same domain.

 

Sep 3, 2009 at 9:28 PM

Hi Roman,

That sounds good. I will try this. But not right away. I have vacation and my kids start school next week, so I will not be at the keyboard for a couple weeks.

best regards,

Bill

Sep 12, 2009 at 4:29 PM
rivantsov wrote:

No, you misunderstood. The first part is correct - you split into two methods.

You don't need signed dll or mess with GAC. I'm not talking about global assembly cache - this is for cashing assemblies. I meant to cache the object instance of LanguageData inside Web server cache.

I think we have even easier solution. You can the following static singleton property to SearchGrammar class:

   private static LanguageData _languageInstance;

   public static LanguageData LanguageInstance {
      get {
          if (_languageInstance == null) {
             lock(typeof(SearchGrammar) {
                if (_languageInstance == null)
                    _languageInstance = new LanguageData(new SearchGrammar());
             }//lock

        }//if
        return _languageInstance;
      }//get

   }//property

(This double-checking for null and lock in between is a standard "thread-safe singleton pattern")

Then you can create a Parser on each request:

    var parser = new Parser(SearchGrammar.LanguageInstance);

In this case you will have a shared instance of language data per app domain. ASP.NET may create several domains on one server (possibly, not sure) but this small duplication is OK i think. The main point I think is that ASP.NET would keep domain alive for multiple requests, so static data will remain intact, and LanguageData can be reused again when request is processed in the same domain.

 

===============================

I would been wrestling with signed dll's and global assemblies and it seems there is an easier method. Could you point me in the correct direction as to how to really get this into the web server cache. I have been up and down many a path and am now clear that I don't how to get this into the web server cache.

 

-Michael

 

Coordinator
Sep 14, 2009 at 2:47 AM

I don't understand what is the problem here. Doing this web server cache thing - it is performance optimization, no more. It will still work without all this, just with tiny extra delays (few milliseconds). Can you run it as is on web page?! First make it run without optimizations, then move to use this static field for LanguageData I described

Am i missing something?

Roman