So the first step in my project is getting from a print text to a digital text.
There are a few options for doing OCR. Bruce Robertson at Mt. Allison University is involved in LACE: Greek OCR, a large scale project to produce high quality OCR of ancient Greek texts. But apart from submitting a request, this is not really scalable to personal use.
Ancient Greek OCR provides a number of options (platform dependent) for using the Tesseract OCR engine trained for Ancient Greek. It’s the work of Nick White.
For a pdf input, I grabbed Patrologia Graeca 46 of Archive.org. I then created a pdf of only the pages I needed for DDAE (6), to save loading time. Here’s an image with the first page open, and about to click “Recognize” which will run the OCR process on the selection and output it to the sidebar on the right.
It takes around 3 minutes to run a full page (well, half-page) of Migne. I cut and paste the results to a text editor so that I could save the result as a UTF-8 encoded .txt, rather than the default in the program which appears to generate an ANSI file (which is not useful).
Then I worked with pdf and text side by side to manually correct the OCR results. This is slightly laborious, and using some kind of editing process like Robertson has up on the Greek OCR Challenge page would probably speed this process.
Instead, I did this:
The general quality of the OCR was good, but it did need corrections. I’d love to speed up this process, because it takes me circa 10 mins to do a page.
So, 13 mins a page, 6 pages, looking at almost 80 mins work to do the OCR work to produce a corrected text. Then a bit of quick editing to remove line breaks and generate a single continuous text.