A number of various iFilters are available for use with PDFs:
- Foxit PDF iFilter (paid)
- Adobe PDF iFilter (free)
- PDFBox.NET (free)
In this example I will be using the PDFBox.NET (version 1.8.9) iFilter. This is due to the fact that it has no cost, and installs cleaner (not in program files etc.).
Installing PDFBox.NET
Download the latest version and then open the package to find a folder of DLLs. Your visual studio solution will require the following references:
- IKVM.OpenJDK.Core.dll
- IKVM.OpenJDK.SwingAWT.dll
- pdfbox-1.8.9.dll
and the bin folder of your Sitecore web site will require the following references:
- commons-logging.dll
- fontbox-1.8.9.dll
- IKVM.OpenJDK.Text.dll
- IKVM.OpenJDK.Util.dll
- IKVM.Runtime.dll
Creating the custom search index field
On the index definition XML document the computed field index will need to be defined, and the actual logic implemented.
public class IndexPDF : IComputedIndexField { /// <inheritdoc /> public string FieldName { get; set; } /// <inheritdoc /> public string ReturnType { get; set; } /// <inheritdoc /> public object ComputeFieldValue(IIndexable indexable) { Item item = indexable as SitecoreIndexableItem; if (item != null && item.Paths.IsMediaItem) // Only for media items { try { if (item.TemplateID == new ID("{0603F166-35B8-469F-8123-E8D87BEDC171}")) // PDF template ID { return ParsePDF(item); } } catch (Exception) { return null; } } return null; // Return null if nothing to index } private string ParsePDF(MediaItem mediaItem) { PDDocument doc = null; string content = string.Empty; InputStreamWrapper wrapper = null; if (mediaItem != null) { try { wrapper = new InputStreamWrapper(mediaItem.GetMediaStream()); doc = PDDocument.load(wrapper); content = new PDFTextStripper().getText(doc); } catch (Exception ex) { return null; } finally { if ((doc != null) && (wrapper != null)) { doc.close(); wrapper.close(); } } } if (!string.IsNullOrEmpty(content)) { // Replace all whitespace with single space content = Regex.Replace(content, @"\s+", " "); } return content; } }
<fields hint="raw:AddComputedIndexField"> <field fieldName="_content" storageType="no" indexType="tokenized" patch:after="field[last()]">Myproject.IndexPDF, MyProject</field> </fields>This computed field uses the iFilter to read in the PDF content and then append it to the main _content field in Sitecore. The storage type is not set to no by default, however I have tested it with stored content to allow the context of the search term to be shown in the results.
Querying the PDF content
As with any indexed field in Lucene, we simply map the indexed field (_content or custom) to out model and can then use predicate or linq logic to query it.
No comments:
Post a Comment