I was reading about the reCAPTCHA™ project this morning. On their page about High Transcription Accuracy I saw that some words are simply digitized so poorly that even humans can't make them out. However, I noticed that they kind of give up there. If humans can't make out the whole word then the transcription just keeps some garbled nonsense word and leaves it at that. For instance, look at the last error in their example. I have included screen shots from their page for easier comparison:
OK, now I want you to notice a few things:
- The original word is clearly one word with a comma after it but the transcription shows a word and a letter separated by a space with no comma.
- In the original word it is pretty darn clear that the first letter is a capital 'R' but that is not included in the transcribed word.
- In the context of the rest of the document the word is obviously the name of a yacht.
- It is pretty obvious that the second and third letters are not the same, yet in the transcription they are both 'u's.
- The second letter in the original looks like an 'a' but it definitely does not have an ascender.
- The third letter definitely does have an ascender and yet the letter in the transcribe word does not.
- Depending on what the third letter is, the mark to it's right may be part of that third letter or it may be a completely separate letter. One thing is for sure, that mark is at exactly the same angle as the third letter.
- The last letter is definitely an 'r' followed by a comma.
- The two marks before the 'r' are either a letter with a left ascender such as a 'b' or 'h', or they are a single letter, most likely a lower-case 'L' followed by a letter that curves on the left and has no ascender..
- The space between the last 'r' and that tall letter before it seems to be just a little too wide for the letter before it to be something like a 'b' or 'h'. It seems more likely that there are two letters before the 'r'. Something like 'le' or 'lc' perhaps.
- When was the last time you saw a word that ended in 'br' or 'hr'?
As a matter of fact, after really looking at the word and seeing it in context, my best guess is that the word is 'Rattler,' (with the comma).
So I have a few suggestions as to how reCAPTCHA™ can improve their digitization:
- For these hard to decipher words, they should include more of the context surrounding the target word. People often need context in order to correctly transcribe a word. Without the context we have far less to go on.
- Include a picture of a much larger part of the original scan and highlight the part that is in question. This will give users a better feel for how certain letters show up in that scan. Different books use different fonts and different typesetters have different levels of consistency in how the type was placed during printing. Humans can intuitively use that information to help them figure out a word or a letter.
- Include a few sentences of the text that has been transcribed with high confidence surrounding the target word. This will give humans a context within which to work. Only by looking at the whole sentence was I able to determine that the word in question was the name of a Yacht. That narrows down the list of possibilities considerably.
- We need an XML standard for marking up these 'iffy' words so that we can at least capture and store what information we do have about them. Then, even if we can't figure out exactly what the word is, intelligent search engines can locate it as a possible match to something else.
- For instance, someone may be searching for a yacht named 'Rattler.' No search engine would have matched to 'buub r.' However, if the search engine could know that the word in this document started with a capital 'R', ended with a lower-case 'r', had from five to seven letters in it, that two or three of the letters had ascenders, and it was likely the name of a yacht, then the engine could show it as a possible match. If the search engine was integrated with the reCAPTCHA™ engine then, once the user had determined that the word very likely was 'Rattler' then reCAPTCHA could update its information about the word and make it even easier to find and transcribe later.
I do not propose to devise that XML standard here but it should at least be able to do what I have mentioned here as well as list all the most likely choices as entered by reCAPTCHA™ users. As it stands all we get is a garbled aggregate of what people guessed based on the severely limited information they had to work with. Not that reCAPTCHA™ isn't genius and doing wonderful work. But in these situations there is definitely room for improvement.
Again, as they say on Usenet, "I hope that helps."
The contents of this post is Copyright © 2009 by Grant Sheridan Robertson. However, reCAPTCHA™ is welcome to use these ideas to improve the quality of their fine work.