Tuesday, March 1, 2016

Sitecore 8 Lucene indexing PDF content

By default, Sitecore will not index the content inside document types such as PDF or DOC. It requires the use of an iFilter and a custom Sitecore index field. Simply put, the custom index field will read the content of the document (using the iFilter) and then the content will be inside the Lucene index and available for searching.

A number of various iFilters are available for use with PDFs:
In this example I will be using the PDFBox.NET (version 1.8.9) iFilter. This is due to the fact that it has no cost, and installs cleaner (not in program files etc.).

Installing PDFBox.NET

Download the latest version and then open the package to find a folder of DLLs. Your visual studio solution will require the following references:
  • IKVM.OpenJDK.Core.dll
  • IKVM.OpenJDK.SwingAWT.dll
  • pdfbox-1.8.9.dll
and the bin folder of your Sitecore web site will require the following references:
  • commons-logging.dll
  • fontbox-1.8.9.dll
  • IKVM.OpenJDK.Text.dll
  • IKVM.OpenJDK.Util.dll
  • IKVM.Runtime.dll

Creating the custom search index field

On the index definition XML document the computed field index will need to be defined, and the actual logic implemented.

This computed field uses the iFilter to read in the PDF content and then append it to the main _content field in Sitecore. The storage type is not set to no by default, however I have tested it with stored content to allow the context of the search term to be shown in the results.

Querying the PDF content

As with any indexed field in Lucene, we simply map the indexed field (_content or custom) to out model and can then use predicate or linq logic to query it.

No comments:

Post a Comment