Tuesday, April 12, 2016

Using Sitecore search to resolve PDFs not found

When you link to a PDF or other media library item in Sitecore, and subsequently move that file, the link will remain intact. Likewise if you attempt to delete an item linked internally, a warning will present itself to the user. The problem I faced is PDF files which are linked externally (via email and third party websites for example), along with migration from another CMS to Sitecore with a large amount of PDFs.

The solution, was to index all PDFs in the media library (using a Lucene search index) by name. Then after the default Sitecore media handler, add a custom one for PDFs that searched that index for a file that macthes the requested document name. If a single match was found, that would then be served up. If not the request would ultimately 404.

Lucene Index
<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <contentSearch>
      <configuration type="Sitecore.ContentSearch.ContentSearchConfiguration, Sitecore.ContentSearch">
        <indexes hint="list:AddIndex">
          <!-- Change this to Sitecore.ContentSearch.LuceneProvider.SwitchOnRebuildLuceneIndex, Sitecore.ContentSearch.LuceneProvider if you would like indexes to be
               built in a temporary directory i.e. while rebuilding is happening, your old indexes work like normal until the rebuild is finished. -->
          <index id="PdfIndex" type="Sitecore.ContentSearch.LuceneProvider.LuceneIndex, Sitecore.ContentSearch.LuceneProvider">
            <param desc="name">$(id)</param>
            <param desc="folder">$(id)</param>
            <!-- This initializes index property store. Id has to be set to the index id -->
            <param desc="propertyStore" ref="contentSearch/indexConfigurations/databasePropertyStore" param1="$(id)" />
            <configuration ref="contentSearch/indexConfigurations/defaultLuceneIndexConfiguration">
    <include hint="list:IncludeTemplate">
     <template>{0603F166-35B8-469F-8123-E8D87BEDC171}</template> <!-- Unversioned PDF -->
    </include>
    <IndexAllFields>true</IndexAllFields>
            <fields hint="raw:AddComputedIndexField">
            <field fieldName="pdfname" storageType="yes" indexType="untokenized"
                  patch:after="field[last()]">MyProject.ComputedSearchFields.PdfCleanName, Sitecore.Common.Website</field>
          </fields>
   </configuration>
            <strategies hint="list:AddStrategy">
              <!-- NOTE: order of these is controls the execution order -->
              <strategy ref="contentSearch/indexConfigurations/indexUpdateStrategies/onPublishEndAsync" />
            </strategies>
            <locations hint="list:AddCrawler">
              <crawler type="Sitecore.ContentSearch.SitecoreItemCrawler, Sitecore.ContentSearch">
                <Database>web</Database>
                <Root>/sitecore/media library</Root>
              </crawler>
            </locations>
          </index>
        </indexes>
      </configuration>
    </contentSearch>
  </sitecore>
</configuration>

Computed Search Field
namespace MyProject.ComputedSearchFields
{
    public class PdfCleanName : IComputedIndexField
    {
        /// <inheritdoc />
        public string FieldName { get; set; }
        /// <inheritdoc />
        public string ReturnType { get; set; }

        /// <inheritdoc />
        public object ComputeFieldValue(IIndexable indexable)
        {
            Item item = indexable as SitecoreIndexableItem;

            if (item != null)
            {
                MediaItem mediaItem = new MediaItem(item);
                if (mediaItem != null)
                {
                    return mediaItem.Name.Replace("-", " ").Replace("%20", " ").Replace("_", " ");
                }
            }
            return null;
        }
    }
}

The Handler
namespace MyProject.Handlers
{
    public class PdfRewriteHandler : IHttpHandler
    {
        public bool IsReusable
        {
            get { return false; }
        }

        public void ProcessRequest(HttpContext context)
        {
            try
            {
                // If PDF request
                if (context.Request.RawUrl.ToLower().EndsWith(".pdf"))
                {
                    // Terrible code to get the pdf file request file name... I need to practice REGEX
     var itemName = context.Request.RawUrl.Substring(context.Request.RawUrl.LastIndexOf("/")).Replace("/", "").Replace(".pdf", "").Replace("-", " ").Replace("%20", " ").Replace("_", " ");

                    var index = ContentSearchManager.GetIndex("PdfIndex");

                    using (var seaechContext = index.CreateSearchContext())
                    {
                        var results = seaechContext.GetQueryable<MyResultItem>().Where(resultItem => resultItem.Name == itemName).GetResults();

                        if (results.Hits.Count() == 1)
                        {
                            // one match found
                            MediaItem item = (MediaItem)Factory.GetDatabase("web").GetItem(results.Hits.FirstOrDefault().Document.ItemId);

                            if (item != null)
                            {
                                // Get name to be shown when image is saved
                                var fileName = string.Format("{0}.pdf", item.Name);

                                context.Response.Clear();
                                context.Response.ContentType = item.MimeType;
                                context.Response.AppendHeader("Content-Disposition", string.Format("inline;filename=\"{0}\"", fileName));
                                context.Response.StatusCode = (int)HttpStatusCode.OK;
                                context.Response.BufferOutput = true;
                                item.GetMediaStream().CopyTo(context.Response.OutputStream);
                                context.Response.Flush();
                                context.Response.End();
                            }
                        }
                    }
                }
            }
            catch (Exception e)
            {
                Log.Error(e.ToString(), "");

                context.Response.StatusCode = 404;
                context.Response.End();
            }

            context.Response.StatusCode = 404;
            context.Response.End();
        }

        public class MyResultItem : SearchResultItem
        {
            [IndexField("pdfname")]
            public string Name { get; set; }
        }
    }
}

Web.config
<add verb="*" path="sitecore_media.ashx" type="Sitecore.Resources.Media.MediaRequestHandler, Sitecore.Kernel" name="Sitecore.MediaRequestHandler" />
<add name="PdfRewriteHandler" path="*.pdf" verb="*" type="MyProject.Handlers.PdfRewriteHandler" resourceType="Unspecified" preCondition="integratedMode" />

We place the custom handler after the Sitecore media handler, because if Sitecore doesn't find the file, it's a 404.

This concept could be extended using Lucene to find files that are not exact matches (like the file name or contains key words for example). In my case the requirement was an exact match, however if there were more than one they could be presented on a page where the user is able to select the relevant one.

Another option is to place all legacy PDF files in a single folder and set that as the crawler root in the Lucene search index configuration.

No comments:

Post a Comment