So the first step in my project is getting from a print text to a digital text.
There are a few options for doing OCR. Bruce Robertson at Mt. Allison University is involved in LACE: Greek OCR, a large scale project to produce high quality OCR of ancient Greek texts. But apart from submitting a request, this is not really scalable to personal use.
Antigraphaeus allows you to do Greek OCR in your browser. It’s the child of Ryan Baumann, and basically instantiates online what I describe below.
Ancient Greek OCR provides a number of options (platform dependent) for using the Tesseract OCR engine trained for Ancient Greek. It’s the work of Nick White.
I followed the instructions to install and set-up gImageReader and found it pretty straightforward.
For a pdf input, I grabbed Patrologia Graeca 46 of Archive.org. I then created a pdf of only the pages I needed for DDAE (6), to save loading time. Here’s an image with the first page open, and about to click “Recognize” which will run the OCR process on the selection and output it to the sidebar on the right.
It takes around 3 minutes to run a full page (well, half-page) of Migne. I cut and paste the results to a text editor so that I could save the result as a UTF-8 encoded .txt, rather than the default in the program which appears to generate an ANSI file (which is not useful).
Then I worked with pdf and text side by side to manually correct the OCR results. This is slightly laborious, and using some kind of editing process like Robertson has up on the Greek OCR Challenge page would probably speed this process.
Instead, I did this:
The general quality of the OCR was good, but it did need corrections. I’d love to speed up this process, because it takes me circa 10 mins to do a page.
So, 13 mins a page, 6 pages, looking at almost 80 mins work to do the OCR work to produce a corrected text. Then a bit of quick editing to remove line breaks and generate a single continuous text.
I wonder if you could improve things at all by running both Antigraphaeus and Ancient Greek OCR and then putting the results in something like DiffMerge to have the two separate OCR’s check each other’s work. That might speed up the process, perhaps?
I wonder… I’ll give that a try.
I actually haven’t corrected all 6 pages, so open to trying some other options.
I might try tracking the error rate on the next page or two as well.