Monday, March 28, 2011

How to remove Renderable Text from .PDF files to allow OCR

For all those people out there - students, academics, archivists, and eBooks readers - who have been stymied by Adobe® Acrobat's® stubborn refusal to perform optical character recognition (OCR) on a document, claiming: "Acrobat could not perform recognition (OCR) on this page because: This page contains renderable text." - I believe I have found a workable solution. Notice, I am not saying it is "The" solution. That would be for Adobe® to fix their software. I just think this is a workable solution which is much better than the "save to TIFF and rebuild from there" solution offered by Adobe®. Using this technique, it is possible to obtain a searchable and text-select-able document while preserving the original image of the scanned document, if desired.

Basics:

  1. Print the "malfunctioning" .PDF file to the "Microsoft XPS Document Writer" printer driver (which you will need to install).
  2. Convert the resulting .XPS file to an Acrobat® .PDF file.
  3. Perform OCR in Acrobat® using one of the three available output styles depending on the type of document you have and the results you want.