Pages

Monday, November 23, 2009

Spiderman?

I am often anxious about everything that will be involved when I finally start getting some attention for DEMML and start actually implementing it. When I feel this way I am reminded of a classic Spiderman line and encourage myself with my own modification:

With a great idea comes great responsibility.


The contents of this post is Copyright © 2009 by Grant Sheridan Robertson.

Conflating Two Freedoms

Freedom of the people should always trump businesses freedom to make money. Conflating the two inevitably leads to reversing them.


The contents of this post is Copyright © 2009 by Grant Sheridan Robertson.

Saturday, November 21, 2009

A word to describe me.

In my graphic design class one of my first assignments was to take a word that I felt described me and stylize it in some aesthetically pleasing way. Most of the other students quickly sat down, typed a word, tried different fonts, and then used some of the fancy tools in Adobe Illustrator to mess with the outline of the word and add drop shadows and such. I went home to think.

I really don't believe that I can be described in only one word. On top of that, most people see me one way based on my outward appearance but I think I am actually quite different and more nuanced than they usually think. Sure, most everyone thinks they are unique. But I am a pretty unassuming guy and I tend to get pigeon-holed quite a lot. And, if you have read any of the other posts on this blog, you will see that I am not your average Joe either. So I decided that people usually think I am predictable while I feel that I am actually pretty indecipherable. So I set out to design a graphic that made that point. Here is what I came up with:


Predictable - Indecipherable

(You can click on the picture for a full sized view.)
At first glance it looks as if the word is "Predictable." But, if you look closely, you can find the word "Indecipherable."

In critique, my teacher said that, although the design was simple, it was the only one that was actually "Designed" and that was what the class was really about. I thought that was pretty cool

The contents of this post is Copyright © 2009 by Grant Sheridan Robertson.

Thursday, November 5, 2009

HTMLzip

You know how some "books" are published as a folder full of HTML files with an index.html at the root of that set of folders? That makes for a heck of of a lot of files that are really compressible, just sitting there on your hard drive uncompressed. This is necessary because browsers can't see into .ZIP files. Well, I say, why the heck not? The compression algorithms seem to be everywhere except in the browsers. We could zip up a folder full of HTML files (and their accompanying images, etc.), give it an extension like htmlzip, and then just point the browser to that file. It would open the index.html file by default and there you go. An HTML book all in one file simply by zipping it up and changing the extension.

I know there are programs that will convert a set of HTML files to a .chm help file and various other things. But these are often proprietary and platform specific. This would provide a completely open, cross-platform, and really convenient way to do the same thing.


The contents of this post is Copyright © 2009 by Grant Sheridan Robertson.
However, anyone is welcome to incorporate this idea into their browser. In fact, please do. Thanks.

Tuesday, November 3, 2009

reCAPTCHA Suggestions

I was reading about the reCAPTCHA™ project this morning. On their page about High Transcription Accuracy I saw that some words are simply digitized so poorly that even humans can't make them out. However, I noticed that they kind of give up there. If humans can't make out the whole word then the transcription just keeps some garbled nonsense word and leaves it at that. For instance, look at the last error in their example. I have included screen shots from their page for easier comparison:

Unsolved reCAPTCHA errors
Original Original Scanned Word
"Solution" buub r

OK, now I want you to notice a few things:

  • The original word is clearly one word with a comma after it but the transcription shows a word and a letter separated by a space with no comma.
  • In the original word it is pretty darn clear that the first letter is a capital 'R' but that is not included in the transcribed word.
  • In the context of the rest of the document the word is obviously the name of a yacht.
  • It is pretty obvious that the second and third letters are not the same, yet in the transcription they are both 'u's.
  • The second letter in the original looks like an 'a' but it definitely does not have an ascender.
  • The third letter definitely does have an ascender and yet the letter in the transcribe word does not.
  • Depending on what the third letter is, the mark to it's right may be part of that third letter or it may be a completely separate letter. One thing is for sure, that mark is at exactly the same angle as the third letter.
  • The last letter is definitely an 'r' followed by a comma.
  • The two marks before the 'r' are either a letter with a left ascender such as a 'b' or 'h', or they are a single letter, most likely a lower-case 'L' followed by a letter that curves on the left and has no ascender..
  • The space between the last 'r' and that tall letter before it seems to be just a little too wide for the letter before it to be something like a 'b' or 'h'. It seems more likely that there are two letters before the 'r'. Something like 'le' or 'lc' perhaps.
  • When was the last time you saw a word that ended in 'br' or 'hr'?

As a matter of fact, after really looking at the word and seeing it in context, my best guess is that the word is 'Rattler,' (with the comma).

So I have a few suggestions as to how reCAPTCHA™ can improve their digitization:

  • For these hard to decipher words, they should include more of the context surrounding the target word. People often need context in order to correctly transcribe a word. Without the context we have far less to go on.
    • Include a picture of a much larger part of the original scan and highlight the part that is in question. This will give users a better feel for how certain letters show up in that scan. Different books use different fonts and different typesetters have different levels of consistency in how the type was placed during printing. Humans can intuitively use that information to help them figure out a word or a letter.
    • Include a few sentences of the text that has been transcribed with high confidence surrounding the target word. This will give humans a context within which to work. Only by looking at the whole sentence was I able to determine that the word in question was the name of a Yacht. That narrows down the list of possibilities considerably.
  • We need an XML standard for marking up these 'iffy' words so that we can at least capture and store what information we do have about them. Then, even if we can't figure out exactly what the word is, intelligent search engines can locate it as a possible match to something else.
    • For instance, someone may be searching for a yacht named 'Rattler.' No search engine would have matched to 'buub r.' However, if the search engine could know that the word in this document started with a capital 'R', ended with a lower-case 'r', had from five to seven letters in it, that two or three of the letters had ascenders, and it was likely the name of a yacht, then the engine could show it as a possible match. If the search engine was integrated with the reCAPTCHA™ engine then, once the user had determined that the word very likely was 'Rattler' then reCAPTCHA could update its information about the word and make it even easier to find and transcribe later.

I do not propose to devise that XML standard here but it should at least be able to do what I have mentioned here as well as list all the most likely choices as entered by reCAPTCHA™ users. As it stands all we get is a garbled aggregate of what people guessed based on the severely limited information they had to work with. Not that reCAPTCHA™ isn't genius and doing wonderful work. But in these situations there is definitely room for improvement.

Again, as they say on Usenet, "I hope that helps."


The contents of this post is Copyright © 2009 by Grant Sheridan Robertson. However, reCAPTCHA™ is welcome to use these ideas to improve the quality of their fine work.