gamehistoryorg

Our OCR moonshot for the digital library

Added 2024-06-03 21:37:29 +0000 UTC

Hi folks! Library director Phil Salvador here. I wanted to share an update on what we've been doing with our digital library and an exciting tool we've developed that's going to make our library an even better resource for researchers.

One of the features we've advertised for our digital archive is the ability to search digital scans of video game magazines. We've been building a pretty extensive magazine collection, and full-text searching is a great way to make their contents discoverable.

Unfortunately... old video game magazines do not play nicely with text recognition software.

A quick rundown of how this process usually works: Once you scan a document, you can run it through optical character recognition (OCR), which identifies all the text and makes it searchable. OCR has been around for ages, and it works great on traditional documents. Think plain, dark text on a light backdrop, like a book.

If you've ever seen a video game magazine, though, you know that their layouts are chaotic and crazy. Tons of colors, busy backgrounds, weird fonts. If you feed a scan like this to an OCR tool, it just kind of emotionally shuts down.

Here's an example: This is a review of the Sega Genesis game Warlock, which ran in the May 1995 issue of Diehard GameFan:

Yikes! Adobe Acrobat doesn't even attempt to identify the text on these pages.

There's an open-source OCR tool called Tesseract which is widely used in a lot of places, including the Internet Archive, but its results aren't much more promising:

liL •ell
TiflEHTfl]
WKBPii
''r game takes on a subsequent
ACCLAIM • 16MEG ACTION/PLATFORM 1 PLAYER AVAILABLE APRIL
Hull
l|Mri
r.m!l
ilikiiliii

While I was preparing the demo for our digital library that we shared last December, I noticed how much text wasn't getting picked up by Tesseract on some of these scans, and we decided that we needed a better solution.

One option we explored was to use machine vision to identify the text. Machine vision tech has made some big advances in the last five years with tools like Amazon Textract or Google Vision, which do a much better job handling OCR on difficult documents like this. It seemed promising!

But that didn't quite work for what we're trying to do. Tools like Textract are designed to extract text data. When you feed it a document, it identifies the text, and then that data lives on Amazon's servers. There's no option to export that text back to the PDF to make it searchable. Nobody is using these tools to do bulk text recognition for PDFs.

The results were really, really promising though, so we decided to take the time to figure out how to make this work. And we did!

For the last couple months, our director of technology Travis Brown has been developing a tool we codenamed "Optical Character Recognition for Archival Purposes," or "OCRAP." Using Textract, Travis was able to figure out how to pull the data off Amazon's servers, reformat it into an OCR-friendly format, then feed that data back through an OCR engine to make it searchable in the PDF. It sounds like an absolute Frankenstein solution, and it kind of is, but the results are remarkable and speak for themselves.

Here's a sample of how our tool handled that Warlock review, copied out of the PDF:

WARLOCK
With the advent of the next
generation gaming machines, it
takes much more to impress me
these days. In the case of
Warlock, I erased all those great
texture-mapped polygons from my
memory banks and made certain
to judge it only by 16-bit Genesis
standards. And by those stan
dards, Warlock is pretty darn
mediocre.

There's still a few trouble spots, but if Tesseract is 75% of the way there, then we feel like our solution is 90% or higher. This feels like a generational leap forward for OCR. We don't know anyone else using machine vision to make searchable PDFs in bulk for research purposes like this.

We've heard from other people in our community who do scanning, like Gaming Alexandria, that there's interest in this tool, so we may look into open-sourcing it. (We should probably give it a better name though.)

One of the reasons it's taken us a while to launch the digital library is because we wanted to get this right. We want our library to be a world-class platform for game history research, and that means trying our best to solve these problems that nobody's really had to address until now.

We're hoping to have more updates about the digital library soon, but for now, we hope you enjoy this peek behind the curtain!