Tuesday, March 1, 2016

Sitecore 8 Lucene indexing PDF content

By default, Sitecore will not index the content inside document types such as PDF or DOC. It requires the use of an iFilter and a custom Sitecore index field. Simply put, the custom index field will read the content of the document (using the iFilter) and then the content will be inside the Lucene index and available for searching.

A number of various iFilters are available for use with PDFs:
In this example I will be using the PDFBox.NET (version 1.8.9) iFilter. This is due to the fact that it has no cost, and installs cleaner (not in program files etc.).

Installing PDFBox.NET

Download the latest version and then open the package to find a folder of DLLs. Your visual studio solution will require the following references:
  • IKVM.OpenJDK.Core.dll
  • IKVM.OpenJDK.SwingAWT.dll
  • pdfbox-1.8.9.dll
and the bin folder of your Sitecore web site will require the following references:
  • commons-logging.dll
  • fontbox-1.8.9.dll
  • IKVM.OpenJDK.Text.dll
  • IKVM.OpenJDK.Util.dll
  • IKVM.Runtime.dll

Creating the custom search index field

On the index definition XML document the computed field index will need to be defined, and the actual logic implemented.
public class IndexPDF : IComputedIndexField
{
    /// <inheritdoc />
    public string FieldName { get; set; }
    /// <inheritdoc />
    public string ReturnType { get; set; }

    /// <inheritdoc />
    public object ComputeFieldValue(IIndexable indexable)
    {
        Item item = indexable as SitecoreIndexableItem;

        if (item != null && item.Paths.IsMediaItem) // Only for media items
        {
            try
            {
                if (item.TemplateID == new ID("{0603F166-35B8-469F-8123-E8D87BEDC171}")) // PDF template ID
                {
                    return ParsePDF(item);
                }
            }
            catch (Exception)
            {
                return null;
            }
        }
           
        return null; // Return null if nothing to index
    }

    private string ParsePDF(MediaItem mediaItem)
    {
        PDDocument doc = null;
        string content = string.Empty;
        InputStreamWrapper wrapper = null;

        if (mediaItem != null)
        {
            try
            {
                wrapper = new InputStreamWrapper(mediaItem.GetMediaStream());
                doc = PDDocument.load(wrapper);
                content = new PDFTextStripper().getText(doc);
            }
            catch (Exception ex)
            {
                return null;
            }
            finally
            {
                if ((doc != null) && (wrapper != null))
                {
                    doc.close();
                    wrapper.close();
                }
            }
        }

        if (!string.IsNullOrEmpty(content))
        {
            // Replace all whitespace with single space
            content = Regex.Replace(content, @"\s+", " ");
        }

        return content;
    }
}
<fields hint="raw:AddComputedIndexField">
  <field fieldName="_content" storageType="no" indexType="tokenized"
          patch:after="field[last()]">Myproject.IndexPDF, MyProject</field>
</fields>
This computed field uses the iFilter to read in the PDF content and then append it to the main _content field in Sitecore. The storage type is not set to no by default, however I have tested it with stored content to allow the context of the search term to be shown in the results.

Querying the PDF content

As with any indexed field in Lucene, we simply map the indexed field (_content or custom) to out model and can then use predicate or linq logic to query it.

No comments:

Post a Comment