Friday, March 11, 2016

Sitecore Lucene highlighting the search term(s) in the search results

A popular feature with any good search engine, is the ability to highlight the search terms in the content shown for each search result. The main benefit with this is that the user is able to see the context of which the search term appears for each result returned, which allows them to choose the result which is most relevant to them.

This feature is not provided out of the box with Sitecore, but you can implement it if you update the Lucene DLL in your Sitecore web site.
A good example of this feature in action is available at Sitecore.Context.Item and it gives a good explanation of the process. However I came across an issue with this code, where if the content that you attempt to highlight the keywords in does not contain the search term(s), null will be returned. This would happen because other fields you are searching have the match and when you send through the content to highlight the getBestFragment method of Lucene will return null.

So here is my updated code which handles this:
/// <summary>
/// Highlight search term
/// </summary>
/// <param name="searchTerm">Search term</param>
/// <param name="searchContent">search content</param>
/// <returns>Search content with highlighted search term</returns>
private static string HighlightSearchTerm(string searchTerm, string searchContent)
    // create analyzer
    var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);

    // create FuzzyQuery using the BooleanQuery for multiple words
    var booleanQuery = new BooleanQuery();
    var segments = searchTerm.ToLower().Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
    foreach (var segment in segments)
        var fuzzyQuery = new FuzzyQuery(new Term("", segment), 0.7f, 3);
        booleanQuery.Add(new BooleanClause(fuzzyQuery, Occur.SHOULD));
    // create highlighter - using strong tag to highlight in this case (change as needed)
    IFormatter formatter = new SimpleHTMLFormatter("<strong>", "</strong>");

    // excerpt set to 200 characters in length
    var fragmenter = new SimpleFragmenter(200);
    var scorer = new QueryScorer(booleanQuery);
    var highlighter = new Highlighter(formatter, scorer) { TextFragmenter = fragmenter };

    // optional step to remove html tags from content
    string rawPageContent = Sitecore.StringUtil.RemoveTags(searchContent);

    // get highlighted fragment
    Lucene.Net.Analysis.TokenStream stream = analyzer.TokenStream("", new StringReader(rawPageContent));
    string highlightedFragment = highlighter.GetBestFragment(stream, rawPageContent);

    if (highlightedFragment == null)
        // null is returned if no matching text found
        return searchContent;

    return highlightedFragment;
Basically what it does is builds the Lucene query (with fuzzy logic) against a piece of content then uses the GetBestFragment method to place strong tags around the search terms if found in the content. It's also able to pickup plural terms (for example dogs when the search term is dog).

You can also modify the HTML tags to use for the highlighting if you have some custom CSS, and if relevant can update the Lucene search logic used to match that of your actual search code.

No comments:

Post a Comment