Pages

Monday, March 28, 2011

How to remove Renderable Text from .PDF files to allow OCR

For all those people out there - students, academics, archivists, and eBooks readers - who have been stymied by Adobe® Acrobat's® stubborn refusal to perform optical character recognition (OCR) on a document, claiming: "Acrobat could not perform recognition (OCR) on this page because: This page contains renderable text." - I believe I have found a workable solution. Notice, I am not saying it is "The" solution. That would be for Adobe® to fix their software. I just think this is a workable solution which is much better than the "save to TIFF and rebuild from there" solution offered by Adobe®. Using this technique, it is possible to obtain a searchable and text-select-able document while preserving the original image of the scanned document, if desired.

Basics:

  1. Print the "malfunctioning" .PDF file to the "Microsoft XPS Document Writer" printer driver (which you will need to install).
  2. Convert the resulting .XPS file to an Acrobat® .PDF file.
  3. Perform OCR in Acrobat® using one of the three available output styles depending on the type of document you have and the results you want.

Preliminary Notes:

  • You need the full (or "Pro") version of Adobe® Acrobat® to complete this procedure. However, as this same program is required to perform OCR from within Acrobat®, and anyone reading this is doing so because they normally would have been able to do the OCR but can't for some some specific document(s), I assume the reader has access to this "Pro" version of Adobe® Acrobat®; henceforth to be referred to simply as "Acrobat®." I use Acrobat 9 Pro®, but these procedures will likely work on any relatively recent version of the product.
  • This trick can only be done on Windows® computers, but the resulting file can then be used anywhere.
  • Although this trick does not require a lot of tedious manual labor, it does take up a lot of computer time and processing power.
  • I recommend testing these procedures out on individual - extracted - pages of your document, both to ensure you understand the process and to allow you to quickly try different variations so you can decide which result you like best.
    • To extract a single page in Acrobat®:
      1. Open the thumbnail pane.
      2. Select a sample page.
      3. Right-click and choose "Extract Pages" and follow the prompts.
        • (Name the files appropriately so you can better judge the results of your experiments.)
    • (You may want to choose three different pages - text only, line drawing or graphics heavy, and photographic image heavy - to experiment around with.)
  • This process generates some really large transitional files. Your final files are likely to be somewhat larger than the original file, depending on the original document and which OCR output style you choose. However, they will also be a lot more useful.

Full Procedure:

Install the XPS printer driver if you don't already have it on your computer:

XPS is Microsoft's® answer to the Adobe® Acrobat® file format. It stands for "XML Paper Specification," following Microsoft's® habit of using generic naming for their products, as if they were the only product of their type in existence. From what I have read, it is supposedly similar to Acrobat® except that everything is in XML and can therefore be read by humans. It also makes for some extraneously large files. Fortunately we don't have to leave our files in this format. It is merely used as a transitional format, the conversion to which, strips out the bothersome "renderable text."

  1. Download the XPS printer driver here: http://www.microsoft.com/downloads/en/details.aspx?FamilyID=b8dcffdd-e3a5-44cc-8021-7649fd37ffee&displaylang=en.
  2. Save the file where you can find it then double-click it to start the install. Follow the prompts to complete the install.
    • This will create a new printer in your "Printers and Faxes" folder. To print to it, you simply choose that printer instead of your regular printer when you print a document.

Print the .PDF file to the .XPS "printer.":

  1. Open the file in question using the latest version of Acrobat Reader and follow these GCGUINS instructions: { File / Print ; Printer, Name = Mocrpsoft XPS Document Writer[v] ; [Properties] ; <Layout> ; [Advanced] ; Microsoft XPS .../Document Options/ Interleaving: = Off...[v] ; Images: = PNG-Lossless compression[v] ; [OK] ; [OK] ; Page Handling, Page Scaling: = None[v] ; [ ] Auto-Rotate... ; [OK] } to "print" it to the "Microsoft XPS Document Writer" printer driver just as you would when "printing" to an Acrobat® .PDF file. The printer driver will open up a "File Save" dialog asking where to save the .XPS file.

This could take quite some time depending on how much "rendered text" (i.e. selectable text) is in the document. Text that is actually only an image should convert rather quickly because this process seems to simply move the image portions of the documents straight over without any conversion or alteration whatsoever. Though I am not positive, the little bit of poking around in the document I did, causes me to speculate that the .XPS printer driver converts each and every character in the document into a vector graphic, similar to an Adobe postscript file. As you can imagine, this makes for an incredibly large file (see the table below) and it takes a really long time. I would suggest you start this process and then go off to a long lunch or meeting. If you have a separate computer on which you can run these processes, more's the better.

Convert the .XPS file back into a .PDF file.

Now this step is really going to take a long time, perhaps hours. If you have a large document with lots of "rendered text," I recommend that you start the process before going to bed or before leaving the office for the night. In addition, once you have started this process, it will look as if your computer isn't doing anything at all for almost the entire time. This is because Acrobat® does not display any user interface until it has completed the conversion and has a .PDF document to show.

  1. Right click on the file and choose the appropriate context menu option:
    • Some installations of Acrobat® place an item in the Windows® file explorer context menu (pops up when you right-click on a file) that says "Combine supported files in Acrobat..." when you right-click on any file that Acrobat® knows how to convert to .PDF format. If you see this option in your context menu when you right-click on a .XPS file then choose it because this gives you the most control. (Yes, it works even though you only selected one file.)
      • In the 'Combine Files' dialog, in the lower right corner: Choose the largest document icon to choose the largest file size, and click [Combine Files].
    • If the above option is not available look for 'Convert to Adobe PDF.' This function will not open any dialog or the Acrobat Pro window until the file has been completely converted. It will look as if your computer is either not doing anything or is locked up. Don't reboot like I did the first few times interrupting the process. Just be patient.
    • If you don't see either of the above options then - from the context menu - choose: { Open With / Adobe Acrobat x } or choose { Open With / Choose Program... } and select Adobe Acrobat® from the list. Be sure not to select Acrobat Reader®.
      • (I wouldn't recommend selecting the "Always use the selected program to open this kind of file" option because you only want to open .XPS files in Acrobat when you really want to convert them to .PDF format. If you just want to view the file quickly, you really should just use the XPS viewer. It is a lot faster.)
  2. (Optional) If you had to use either of the last two options above then you may want to double check that things have actually started processing. As I stated earlier, Acrobat® may not display anything for hours. The best way to check on this is to use the Windows® Task Manager.
    1. Right-click in the task bar and choose "Task Manager" (XP) or "Start Task Manager" (Windows 7).
    2. Select the "Processes" tab and look for "acrobat.exe." (If you click the "CPU" column header twice (not double-click) then acrobat.exe should be at about the top of the list.) The acrobat.exe process should be using about 50% of your CPU time.
    3. Now look in the "Memory ..." column (normally the fourth one). acrobat.exe should be using up ever increasing amounts of memory.
      • Believe it or not, that is how you know Acrobat® is processing your file. Essentially, Acrobat® is building up a complete .PDF file in memory before displaying it to you. Considering that the .XPS file has a separate vector graphic for each separate character in the file, that is a lot of data. And, until you do the OCR, all that data is in the .PDF file too.
  3. Go to bed and get some sleep. Research shows this is very important to your overall productivity and health.
  4. Save the file.
    • Acrobat® does not generate a file on disk. It was only generated in memory. You must save the file to disk yourself. Choose an appropriate file name and DO NOT overwrite your original file.

I do have to admit that this conversion does seem to produce slightly blurier images for scanned documents. It appears that either Acrobat or the XPS driver does a little bit of antialiasing of the jagged edges.

Perform the OCR:

Most people who have used Acrobat® to do OCR know there are three different output styles to choose from: Searchable Image, Searchable Image (exact), and ClearScan. Which you choose depends on the original document and the intended use for the final document.

  1. To select the desired output style and start the OCR process: in Acrobat®: { Document / OCR Text Recognition / Recognize Text using OCR ; Settings, [Edit] ; PDF Output Style = your chosen method as elaborated below [v] ; Downsample Images = Lowest (600 dpi) [v] ; [OK] ; [OK] }
    • (These instructions are in GCGUINS format for concision.)
  2. Save the file using yet another file name.
    • Until you are completely satisfied with the results, you should not delete or overwrite any of these files.

Mostly-Image (scanned) Documents:

Most academics will be dealing with scanned documents, where the "document" is actually just a series of images of pages stored in the .PDF file. These become a problem for OCR when the scanning software did not already do the OCR but did insert some computer printed ("rendered") text, thus causing Acrobat® to choke and show the dreaded "... could not perform recognition (OCR) ..." error dialog.

Now, said academic may want to preserve the original image of the document for possible scrutinizing or grabbing snapshots from in the future. In which case, said academic should choose the "Searchable Image (exact)" OCR output style. Acrobat® recognizes the text but hides the recognized text behind the image, which it does not disturb at all. This produces a pretty large file. However, if the file was really just a series of images to begin with, then the resulting file may not be much larger than the original.

On the other hand, our imaginary academic may want to produce the smallest possible file size, or may have hopes of producing a file that is easier to read than the scanned original. In this case he or she should choose the ClearScan OCR output style. This causes Acrobat® to replace the image file with a set of custom-generated fonts, designed to look as close as possible to the original fonts, but with clean edges instead of blurry, scanned edges. It is easier to read, but, if Acrobat® guessed wrong for some words while doing its OCR magic, then all you are left with is the bad guess. It also sometimes completely gives up and just places a small image of the word - or just a couple of letters - in the spot where those letters should have gone. It is acceptably readable but it looks weird and those words or letters aren't selectable.

The plain "Searchable Image" output style is a decent middle of the road option, but it does modify the look of the page images because they are compressed. You should experiment to make sure you can tolerate the results.

Mostly-Text Documents:

Some of the documents that cause the "renderable text" error look as if they were generated by a computer ("born digital," as some are saying these days) but either some of the text is not selectable or it is selectable but the copied text is gibberish. Many people suspect this is meant to prevent people from copying any of the document for use elsewhere. It also makes the document practically useless for any academic or business purpose. For these kinds of documents, the .XPS file can be ginormous; ten to twenty times the size of the original .PDF file.

The "Searchable Image (exact)" output style does produce the best looking result - the final document looks exactly like the original - but the final .PDF file size is only slightly less ginormous than the .XPS file. This is because all the vector images of all the individual characters in the document are retained when using this OCR output style. While that isn't a problem for a mostly-image (scanned) document because there is a relatively small amount of "rendered text," it is a nightmare for mostly-text documents because of the vast quantity of individual vectors they contain. So, only use the "Searchable Image (exact)" output style if the document also contains images which you absolutely must retain in their original quality. If the most important images are on separate pages from the text then one could selectively OCR only the pages with text using the ClearScan output style.

I do not recommend the plain "Searchable Image" output style because it produces really poor quality character renderings. It is readable and selectable but it is much more difficult to read than documents produced using either the "Searchable Image (exact)" or the "ClearScan"output style.

The ClearScan output style results in very nice looking text as well as files that are usually less than twice the size of the original, sometimes even smaller than the original. However, the images within the document may not look as good as the originals. Again, some selective OCRing may produce a more optimum result, but that requires more manual labor, which we are trying to avoid.

Comparison Chart:

I have performed this conversion on three different types of pages taken from a mostly-text document: a page with all text, one with some text and a single B&W photograph, and one with some text but also some line drawings. The chart below shows the resulting file sizes. If there is nothing in a cell, that means I didn't think it was worth trying that conversion.

Mostly Text - File Size Comparison
  Text Photo Drawings
Original File: 44k 273k 109k
.XPS File: 767k 336k 381k
Converted, Pre-OCR: 611k 290k 323k
Searchable Image: 397k    
Searchable Image (exact): 597k 287k 334k
ClearScan: 44k 89k 127k

 

As you can see, the results vary dramatically. Note, however, that pages with the most text produced the greatest increase in size when printing to the .XPS file. When I processed a 350-page, mostly-text, 10MB document: the XPS file was 175MB, and the resulting document came out to 15MB using the ClearScan OCR method.

I haven't performed similar tests on mostly-image documents at this time. Perhaps I will do so later. Such is the luxury of doing all this only for my own edification and sharing the information completely free (without any ads even).

Hopefully, this article will be a big help for: A) all those students out there trying to OCR all those papers they have collected in their research so they can pull quotes out of them without retyping everything, as well as B) those archivists out there who are trying to make the documents in their collections searchable. Though I have not done so, it should also be possible to write some kind of script that would completely automate this process for batch-processing lots of files at the same time. If this helps you, please let me know. If you have any questions or suggestions, please don't hesitate to contact me.


Creative Commons License
How to remove Renderable Text from .PDF files to allow OCR by Grant Sheridan Robertson is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Permissions beyond the scope of this license may be available here.

60 comments:

  1. Thanks Grant for taking the time to documenting the process in such detail. So sorry to report that despite diligently following the steps, the "...renderable text" error message is replaced by "Could not perform recognition (OCR) on this page because: This page has graphics other than images or text on it. It cannot be captured".

    I have tried specifying different output styles and starting from scratch (deleting the transitional files) a number of times - the latter because I noticed that after right-clicking and converting the .XPS file, doing it again (accidentally or deliberately), did nothing - even if I deleted the .PDF created the first time.

    Using Process Explorer/Task Manager didn't provide any clues as no sign of Acrobat chugging away. Reboot required before retry of .XPS to .PDF conversion worked.

    Any/all suggestions gratefully received. Thanks

    ReplyDelete
  2. daud,

    I don't have a lot of time to check this out this morning and I don't currently have any documents with this problem, so all I can give you is a suggestion off the top of my head:

    Make a copy of the document (after XPS conversion) and then try using the { Tools / Advanced Editing / Touchup Object Tool } to select and delete some of the graphics, and see if the OCR will work then. By the wording of the message you received, it seems that the "offending" graphic is something that is drawn with vector graphics rather than a raster image.

    After you have found the graphic(s) that block OCR you could open the original document and try to copy and paste the graphics back into your OCRed file. You have to have the Touchup Object Tool selected in both documents to complete the copy and paste.

    I know this is incredibly tedious, but I can think of no other way to accomplish this and still preserve the quality of the "offending" graphics. Of course there is still always the convert to TIFF and back method but that will rasterize and pixelate your graphics.

    I hope this helps.

    ReplyDelete
  3. I'm not the original "anonymous" but I didn't have success either. The document I converted back to pdf still had renderable text in it (although not as much as it did originally) and after OCR recognition was completed, the remaining text was so blurry it could not be read.

    ReplyDelete
  4. Anonymous 2:
    You need to tell me more about your document and what you did. Was it a scanned document or "born digital"? Also, try this with your original document opened in the latest version of Acrobat Reader: { File / Print ; Printer, Name = Mocrpsoft XPS Document Writer[v] ; [Properties] ; ; [Advanced] ; Microsoft XPS .../Document Options/ Interleaving: = Off...[v] ; Images: = PNG-Lossless compression[v] ; [OK] ; [OK] ; Page Handling, Page Scaling: = None[v] ; [ ] Auto-Rotate... ; [OK] }

    The above instructions are in GCGUINS format for brevity. (http://www.ideationizing.com/2009/06/grants-concise-gui-notation-system.html)

    I will add this to my instructions above.

    ReplyDelete
  5. I have updated the instructions, included the section on converting back to .PDF format.

    ReplyDelete
  6. I had a similar problem while recognizing an 826 page document. only one page gave me trouble, but it didnt say a "reason:" - what I did to get it to work was incredibly simple: I just used the crop tool, selected the entire page and performed the crop(on just the single stubborn page.) then I performed the OCR on it and it worked perfectly.

    ReplyDelete
    Replies
    1. That did it for me: (adobe X) click on tools, click on crop,use the mouse to select the whole page,click tools, click ocr and voila!...thanks to JonnyPhenomenon!

      Delete
  7. That is an incredible tip, Jonny. I have had similar experiences with other software "back in the day" but not recently. Sometimes software just doesn't handle certain patterns of data sequences within their own data. It will read the file and not raise any red flags. But once it tries to do a certain function then it chokes on just a few bytes that are in a sequence it doesn't expect. Rather than pop up a dialog and ask what you want to do, the software just chokes. I had thought Adobe had learned better than this by now.

    ReplyDelete
  8. Hi Grant,
    I had this problem as well. But after print as a xps file, instead of convert the file back to pdf I launched the Acrobat and opened the xps file within the Acrobat, after what I´ve performed the OCR recognition and it turned out ok.I must say it was a small 6 page document, so maybe it worked that way.
    Anyway, just to say congratulations on the article, and please keep doing this useful work.
    From Portugal...

    ReplyDelete
  9. Thank you for the article. It inspired me to use Automator on my Mac to basically create the workflow you described. It seems to have help the problem so far!
    Just thought I'd give the Automator shout-out for Mac users who may have stumbled here via Google looking for a solution (like me!)

    Best,
    Michelle

    ReplyDelete
  10. Thank you for giving your wonderful idea for this issue.
    I tried this idea in scanned pdf file and getting couple more issues. Here is the detail:
    1. Company logo is not changing as a readable format.
    2. Body text alignment changed. (lines are goes under lines)
    3. After OCR using this method, I am seeing few wording are not selecting perfectly.
    4. When I am changing OCRed text in to notepad then line break is missing.

    ReplyDelete
  11. Thanks Grant.I have encountered the same problem using another software - Nitro PDF.
    Thank you Grant for this insightful and detailed technique. I tried this with the free trial of Nitro, using their create PDF function. Unfortunately, when I chose the XPS file and clicked create, it stopped the process after a few seconds saying 'a problem was encountered in PDF conversion' and it does that each time I try.

    I do not actually know what to do now. Perhaps I should just enter data from the tables I scanned manually, especially if you noted that removing renderable text - during the conversion process - may take hours.

    ReplyDelete
  12. Just after posting my earlier comment I found a solution in my Nitro Professional free trial:
    1. Go to PRINT and select the FreePDFXP (version 1.2, I don't know where this program came from, just found it on pc at work, but I guess it's free and works).
    2. FreePDFXP will prompt you to choose a file name and location for this new pdf file.
    3. Go back in Nitro and open this newely made pdf file.
    4. Now click on the recognize text ORC button and it works like a charm!

    Thanks to this thread, I found this out!

    ReplyDelete
  13. I have a 173MB file, 2500+ pages. After conversion to XPS, all attempts to save as PDF agin failed. "This file is too big..."

    Possibly break this up and reassemble later???

    Thanks...

    ReplyDelete
  14. Jackson, you are expecting too much from Acrobat and OCR in general. It was never intended to be a "perfect" OCR utility which preserves formatting etc. It is a program for generating documents that can be viewed on most any platform. The OCR is essentially just thrown in for good measure. It is pretty good but don't expect miracles.

    If you want better OCR then get a program made specifically for OCR such as OmniPage Professional from Nuance.com but expect to to a lot of work to get what you are asking for. I used to try to make things perfect but I had to learn to settle for "good enough" for the sake of my sanity and, well, having enough time left to do what I was working with the documents for in the first place.

    ReplyDelete
  15. Cipher, it sounds as if you have answered your own question.

    ReplyDelete
  16. To all those who have been posting here asking for help:

    By posting this technique I have in no way claimed to be an expert in the Adobe Acrobat file format. I just figured out a cheap and relatively easy - if time consuming - technique for stripping out the poorly recognized or intentionally garbled text. I honestly don't know much more than what I have already posted here in this blog post. And what I do know, I figured out by experimentation because I was desperate (and a little bored).

    So, my best advice is for people to follow one of the primary rules for asking questions on-line: Make sure you have done your homework before you ask. Do some experimentation and try things out on your own. Only after you have exhausted all possibilities and can explain what you have tried and why you think it didn't work, then come ask questions. Not that I don't like answering questions. I love answering questions. But I am not going to - and likely can't replicate your situation well enough to - do the experimentation for you to figure out what you could have figured out on your own. Besides how would you learn anything that way? I know what I know, and have gotten smart enough to think of all the other things on this blog by doing two things: reading A LOT, and beating my head against things long enough to get a feel for how things work and how to make good guesses.

    ReplyDelete
  17. This was a very helpful tip, thank you. My text was slightly less readable after conversion, but it allowed me to get good OCR out of it. It wouldn't be the best if I was looking to publish/share this document, but for our purposes internal to our organization this was a great fix.

    Thanks for taking the time to put this up here.

    ReplyDelete
  18. Anna,
    First, make sure you have followed the instructions to the letter. There are many ways along the way to end up modifying the image, thus degrading it. When performing the OCR, the "Searchable Image (exact)" method modifies the image the least.

    In the end, one can always keep both versions of the file: The original for printing and the converted & OCRed version for searching and annotating.

    I am glad this trick has helped you.

    Grant

    ReplyDelete
  19. For whatever reason, this didn't work for me. But I was able to copy and paste the .pdf information/table into Word, and then from Word, copy and paste into Excel. I'm posting this because maybe it'll help someone out.

    ReplyDelete
  20. You are genius Grant. Thank you! Thank you! Thank you. Be my guest any time.

    ReplyDelete
  21. For Mac users, this works:
    1) Open pdf in Preview
    2) Export as a tiff
    3) Open tiff in Preview
    4) Print to a pdf file
    5) Open in Acrobat Pro and perform OCR

    ReplyDelete
    Replies
    1. To: Anonymous from Feb 6, 2012

      Yes, that is the standard technique recommended by Adobe. Unfortunately, that technique often results in degraded image quality due to conversion of the image format. PDF files often store the images in JPG format, thus the above technique converts the image from JPG to TIFF and back to JPG. The technique I outline above usually results in better image quality. I suspect this is because there is no conversion of image file formats.

      I did some poking around and it does not seem as if there is a version of the "Microsoft XPS Document Writer" printer driver for Mac OS. So, it seems Mac users are stuck with the Adobe recommendation. Oh well, I do hear that many Mac machines come with a Windows emulator, so maybe people could use that to enable them to use my technique.

      Delete
  22. Great tutorial. Thanks very much. Actually, I found that the process was much quicker than suggested. 250 pages of text took about 30 secs to convert to XPS and then about 7 mins to convert back to pdf.

    ReplyDelete
  23. This method saved me so much time. I was able to search a large document after following these steps. Thank you!!

    ReplyDelete
  24. found it faster to just print to PDF and OCR the resulting document.

    ReplyDelete
    Replies
    1. @Anonymous from June 8,

      That technique does work fine IF the original document A) allows printing to PDF (some don't), B) is a "digital native" in that it was created directly on a computer, and thus has really clear text, and C) does not have any images that may be degraded by printing to .PDF. Believe it or not, I found that the documents I was working with ended up with better looking images by using the XPS round trip method rather than printing to PDF from Acrobat.

      So, yes, a user may want to at least try simply printing to .PDF first, before trying this much more time consuming technique. Thanks for reminding me about that. I had neglected to at least mention it in my post.

      Delete
  25. Every once in a while I get a comment that essentially says, "I have found an even easier solution. Just save to TIFF and then re-import that." To which I say, "Ummm ... re-read the first paragraph."

    If the save to tiff and re-import trick espoused by Adobe worked perfectly then I certainly wouldn't have gone to all this trouble. The problem with that trick is that it often forces two complete re-encodings of the image that comprises the page. The images in .PDF files are usually stored in .JPG format. So to save as a .TIFF forces one re-encoding and then reading the file back into Acrobat forces the images to be re-encoded back into .JPG format. That introduces a lot of noise into the image. Now, if you start with a pristine image then you may not notice. However, a less than pristine, older scan may not fair so well after all this decoding and encoding. I chose my particular method because I suspect that the "print to XPS" driver just uses the image format that is already in the original .PDF file. I tried several experiments and could not discern any image degradation after a full export and re-import operation.

    Now, once you do the OCR, using any method other than 'Searchable Image (exact)' WILL degrade the image a little bit. Only you can determine how much final degradation is acceptable to you. However, starting with as little degradation as possible BEFORE you start the OCR seems to be a good idea to me.

    ReplyDelete
  26. On a Mac I found I could get around this problem by opening the PDF file in Preview, printing the file afresh to PDF to create a new PDF document, then using Adobe Acrobat X Pro to run OCR.

    ReplyDelete
  27. Davo,
    I tried that method with my version of Acrobat Pro for Windows and, for most of the documents where I had this problem, it would not let me use that workaround. It is nice to know that it does work with Acrobat Pro for Mac. It might even work with newer versions of Acrobat Pro for Windows. I wouldn't know. I have an older version. So, it is at least worth a try. If Acrobat doesn't want to print to the Acrobat printer driver, it will pop up an error dialog right away, so you don't really waste any time just trying it.

    Thanks for the tip.
    Grant

    ReplyDelete
  28. I had the following error when I tried to run the OCR - Could not perform OCR - The image resolution is below the minimum 72dpi.

    I don't understand why, the pdf seems crystal even under massive magnification. The work-around I used was to PRINT the xps to pdf rather than using the right click, convert to pdf.

    I'm not sure why there is a difference, but probably best not to question when it comes to Adobe.

    ReplyDelete
    Replies
    1. Printing to .PDF from .XPS would run the file through an extra layer of processing and thus change the image in various possible ways.

      When you originally "printed" to .XPS did you make sure that all of the settings were exactly as described above? If so, perhaps you could redo the whole thing and double-check that there is nothing set in either your Adobe Acrobat settings or in the .XPS printer driver settings to convert the image to 72dpi. I poked around and could only find something under { File / Print ; [Advanced] ; [ ]Print As Image } (see http://www.ideationizing.com/2009/06/grants-concise-gui-notation-system.html for an explanation of the previous, if it is not clear.) This is on my Windows 7 machine.

      Anyway, thank you for posting that workaround. I am sure it will be helpful to someone.

      Delete
  29. I have tried this technique on several problem PDFs to try to find a "better" solution. The results are less than satisfying. What is the advantage to printing to XPS over exporting to JPG? I typically end up converting to JPG and back to PDF which works more often than the print to XPS. Also, the final file is larger when printing to XPS than for exporting to JPG, and JPG seems to preserve bookmarks whereas XPS printing did not. Either way, it involves a long process of dumping the document to an image and then recomposing to PDF.

    Note for those having difficulties: sometimes after dumping to JPG and recombining to PDF, I still have to print the PDF (before or after) to PDF (PDF Print Driver) to prevent Acrobat from crashing. Also, there is the occasional difficult page: I import the JPG image into PSP and print to Adobe Print Driver to fix the individual page. JPG also allows resizing the individual pages before recombining, which is handy.

    I briefly tried TIFF but did not see the advantage over JPG.

    I'm using Acrobat 8.3.1; I tried Acrobat XI trial but it crashed/gave errors the same as Acrobat 8, making me believe that the OCR software has not been updated (at least not in this area I am dealing with).

    Hope that helps. I'm still looking for a better/smaller solution, but have not found one.

    PS: Is it necessary to eliminate the two error messages, "This page contains renderable text" and "Unable to proecss the page because the Paper Capture recognition service experienced an error. (0)"? Does either error mean that OCR failed or that it just passed by the text and OCRed what it could, and these can be ignored? Thanks!

    ReplyDelete
    Replies
    1. The reason I do not suggest printing to .JPG is that .JPG is inherently a "lossy" file format. EVERY time a .JPG is saved it compresses the image (sometimes more, sometimes less) and looses some information and clarity. So, yes, .JPG files are smaller, but that comes at a cost. I chose .XPS because it is A) available to everyone for free and B) it seems the least likely to modify the image. From what I can tell it only copies the image directly from the .PDF format, unscathed.

      P.S. You should be able to answer that question yourself. You are the one seeing the error messages. You are the one looking at the file. How are we to know better than you whether the file has been OCR'd?

      Delete
  30. Forget about TIFF/JPG or printing to XPS. Just print the PDF to the Acrobat print driver with settings (advanced) "as image". Be sure that print settings will use the existing page size or else larger pages will be cropped. I set the dpi to 300. After printing, the document will be ready to be OCRed by Acrobat. This solution makes smaller images (but, if you use OCR "Searchable Image (exact)" it will retain existing image size). It also "fixes" all sorts of issues I've encountered when I used to dump the PDF to JPG and convert back to PDF. I'm using Acrobat 8.3.1 and have had no problems with newer PDF formats using this method.

    ReplyDelete
    Replies
    1. Again, I do not suggest this method because it modifies the image. It may be "good enough" for you. But I do not prefer it. For older scanned images, where the scan may be just barely clear enough for OCR, this could push the image over the threshold toward not being "OCRable."

      Remember, as I have said before, this is not THE solution. It is merely A solution.

      Delete
  31. Hello,
    I have the same problem here. I have a document originally from Illustrator, saved as PDF. Then when I tried to do the "Optimize Scanned PDF" I got the line "Pages contains renderable text". My document has little text on it, only the title, the rest are images and vectors. What can it be the problem? I did the recommended steps already, exactly as it is said, with the result of the new PDF file unable to perform and the same line again "Pages contains renderable text" :(

    If somebody can suggest something I would be very happy. Thank you!

    tini

    ReplyDelete
    Replies
    1. You do not need to use the "Optimize Scanned PDF" if the file was not scanned. If the file was originally from Illustrator, then it is what they call "born digital" and was never scanned. Unless, of course you printed the file from Illustrator and then scanned it for some silly reason. In which case, the fact that it was originally created in Illustrator is absolutely moot.

      If you document only has a tiny bit of text on it, then why are you concerned with "OCRing" it at all?

      Delete
  32. I had this problem today, and the above solution didn't work for me (it usually does work). The solution was that I had to go to Tools > Protection > Remove Hidden Information. It seems the file I had was encoded with a hidden watermark, and I needed to remove that to OCR it (I'm not pirating it or anything - I just had to run OCR because it was terribly done by somebody else, and my iAnnotate highlighter works better with a properly OCRed file). Just wanted to share, in case anybody else had this variation of the problem.

    ReplyDelete
    Replies
    1. This is a very good suggestion. Thank you. It seems people keep coming up with creative ways to try to prevent people from performing OCR on their documents. That just seems a little greedy to me.

      Delete
    2. ANONYMOUS IS AWESOME!!! Oh wow. I've been hitting my head against my computer for at least a week trying to wrestle some text out of these old mainframe generated PDF's with "Renderable Text'. Printing to XPS/converting to PDF did nothing as the "Renderable Text" still remained.

      One swipe of "Remove Hidden Information", finding some overlay of lines hidden in the file, and everything is perfectly OCR'd!

      Thank you Grant, thank you Anonymous, wahoo!!!

      Delete
  33. wow thank you. I just did a test page and it worked, so exciting. I wonder how it will work with the 500 page book.

    ReplyDelete
  34. Very nice, used the article advice, and it worked. Thanks.

    ReplyDelete
  35. Thank you! You saved the day (and a whole lot of work for me!)

    ReplyDelete
  36. I was able to convert from .xps to .pdf by opening the .xps file. Use the Print function and select pdf as the printer

    ReplyDelete
    Replies
    1. I specifically avoided this method in my solution because this method guarantees that the images will go through yet another conversion. By importing the files directly into Acrobat Pro, I believe acrobat is not converting the images and simply bringing them in, in their original format. My entire solution is designed to reduce the chance that any software converts the image files.

      Delete
  37. Worked for me!! Thanks!

    ReplyDelete
  38. It's working! Thanks a lot!

    ReplyDelete
  39. Cool beans! Thanks!

    ReplyDelete
  40. Thanks a lot! I couldn't find a solution anywhere else, you saved the day!

    ReplyDelete
  41. Awesome!! I have a cirriculum in pdf that I wanted to create a study guide from, point by point. I have been working on this all morning and after finding your post, done in minutes!! Thanks!!

    ReplyDelete
  42. After 5 hours of trying to recognize text in a 166 page document (only a portion would be recognized) I found your post. Can't tell you how much I appreciate this...it worked!!

    ReplyDelete
  43. Thank you very much Grant for your post. I tried to get help from Acrobat user forum no luck, but your suggession worked for me in first try. RAAK

    ReplyDelete
  44. hie grant, how can l edit a scanned pdf document with renderable text

    ReplyDelete
    Replies
    1. That depends upon what you mean by "edit." If you just want to change a word or two, then you can simply use the editing features in Adobe Acrobat Pro. No I am not going to give you a tutorial on how to do that. Look it up in the help files. If you want to edit it as if it were a Word document then you can use any of the many programs that can convert .PDF files to Word documents, including recent versions of Word itself. Keep in mind that the formatting will not be identical to the original .PDF, regardless of what the manufacturer of the program says. If the file you get after doing the conversion is garbled, then perform the procedure described in this post and THEN try converting it to a Word document.

      Good luck. That is all the help I can give you. Please remember, this is not a tech support forum.

      Delete
  45. Thank you Mr. Robertson. You have been a patient man for nearly 4 years on this issue. I appreciate all the time you have devoted to this problem, and I hope I speak for all those out there who are so frustrated with Adobe over this.

    ReplyDelete