Monday, November 3, 2008

Thank you, Luis von Ahn

Several years ago, I began transcribing the letters my mother wrote me when she was in Afghanistan . Because many of them were typewritten, I thought I might be able to speed up the process by scanning those pages and using a computer program to convert the scanned pages to text. So, I invested the money in the scanner and software, and I invested lots of time in learning how.

In the end, I found that it was actually faster to type the letter into a blank Word document rather than use the scanner. The reason was, Mom’s old typewriter had a couple letters that were a little crooked and an a and an e that came through with the hole filled in. So, the scanner process missed a lot of words, and I had to go through and figure out what was supposed to be there instead of the garbled word the scanner put in.

Where was Luis von Ahn when I needed him?

Luis von Ahn is a computer scientist at Carnegie Mellon University in Pittsburg. You use an invention of his each time you comment on my blog (you do comment on my blog, don’t you?), because I have a setting where you have to read a wavy word and reproduce it in the box provided. It’s an extra little step, but it’s worth it because it eliminates spam. He called this process CAPTCHA.

As Dr. von Ahn contemplated this process, he came up with two conclusions. The first was that each time someone typed in one of these wavy words, the brain was performing an amazing task, one that no computer could do. The second was that the combined time and keystrokes all the people on the internet wasted performing this little security measure was mind boggling. He figured it came to about half a million hours every day.

So, he set out to find a way to harness all that human brain power, time and keystrokes.

He thought of all the libraries’ efforts to digitize their collections so they can make them universally available. The process they use is very like the one I tried with my mother’s letters. And, like me, they end up with words that the computer can’t decipher. A human being has to look at those words and decode them.
Dr. von Ahn came up with the idea of using something very like CAPTCHA, but instead of having one word to decipher, you would have two. One would be a regular test word used by CAPTCHA; the other would be one of the words that a computer hadn’t been able to read that needed a human to decipher it. That word would be given to several people. If they all agree what word it is, then that is the word that will go into the digitized copy of the book or newspaper that it came from.

Dr. von Ahn says that the number of words already transcribed by this process is something over a billion.

Dr. von Ahn calls this technique reCAPTCHA. It’s used by Facebook and Twitter, among others. Just think, every time you use it, you’re helping to digitize the entire library of the New York Times!

I heard about reCAPTCHA on All Things Considered on NPR last August. Check out the program transcript. It’s such a great service to America, the digitizing of whole libraries, that I thought I’d find article after article about it on the internet, but I didn’t.

So I’m writing about it. In a few years, when we have access to books and newspapers from the last century or two with a few keystrokes, we’ll have Dr. von Ahn to thank for expediting the process.

Return to the Neighborhood


Monique said...

That was extremely interesting and I had to leave a comment so I could bring the message home by looking at some squiggly letters! I never gave much thought to it.

Liz Adair said...

When there's just one set of letters, it's simply an ingenious spam-avoidance device, but when it's two sets, and they're both wavy words, then one is the digitizing project, and you're a part of it!