An entrée into the world of 18th and 19th century Greek textbooks

Let me tell you about a little side-project I’m cooking up.

One of the great advantages Latin students have in seeking a lot of reading content, is that between the easy reading material that some Latin teachers are pumping out, and the products of the Direct Method, there’s a fair amount of reading material to get stuck into. (See here, here, here).

Not so with Greek. The Direct Method advocates produced not very much Greek, unfortunately. You can see some on the second half of that vivarium novum page. However they are not very accessible either.

So, what I’ve been doing is scouring the 18th and 19th century textbooks, a range of books with endless variations on “A First Greek Reader”. I have almost 30 in scanned pdfs. Some are more “directish method”, others are very traditional, almost all are far more advanced than “easy greek reading” should begin at.

The project:

  • Digitise: create plain text copies of all the readings in these books.
  • Lemmatise: lemmatise all the texts as well
  • Gloss: produce a version with appropriate glossing to help readers
  • Annotate: provide notes to go with each text to help readers
  • Record: audio files for each text
  • Scaffold: write Greek language content as both pre-reading and post-reading material.

Actually, there are some more things going on behind the scenes as well. Most of my side-projects these days involve overlapping and interlocking methods and goals. In this case, the goal is to create a digital resource of freely available material that helps bridge the long plateau between “1st year Greek” (not a real thing) and “fluently read authentic high-register ancient texts” (a real thing and quite difficult).

More on this, obviously, as it develops…

Actually, since I’ve penned this post, there are at least 2 people doing OCRs of this kind of material. So that means I’m probably going to shift my focus from simply digitising, to making more of this material more usable.

Digital Nyssa Project: From OCR to Plain Text

So the first step in my project is getting from a print text to a digital text.

There are a few options for doing OCR. Bruce Robertson at Mt. Allison University is involved in LACE: Greek OCR, a large scale project to produce high quality OCR of ancient Greek texts. But apart from submitting a request, this is not really scalable to personal use.

Antigraphaeus allows you to do Greek OCR in your browser. It’s the child of Ryan Baumann, and basically instantiates online what I describe below.

Ancient Greek OCR provides a number of options (platform dependent) for using the Tesseract OCR engine trained for Ancient Greek. It’s the work of Nick White.

I followed the instructions to install and set-up gImageReader and found it pretty straightforward.

For a pdf input, I grabbed Patrologia Graeca 46 of Archive.org. I then created a pdf of only the pages I needed for DDAE (6), to save loading time. Here’s an image with the first page open, and about to click “Recognize” which will run the OCR process on the selection and output it to the sidebar on the right.

It takes around 3 minutes to run a full page (well, half-page) of Migne. I cut and paste the results to a text editor so that I could save the result as a UTF-8 encoded .txt, rather than the default in the program which appears to generate an ANSI file (which is not useful).

 

Then I worked with pdf and text side by side to manually correct the OCR results. This is slightly laborious, and using some kind of editing process like Robertson has up on the Greek OCR Challenge page would probably speed this process.

Instead, I did this:

The general quality of the OCR was good, but it did need corrections. I’d love to speed up this process, because it takes me circa 10 mins to do a page.

So, 13 mins a page, 6 pages, looking at almost 80 mins work to do the OCR work to produce a corrected text. Then a bit of quick editing to remove line breaks and generate a single continuous text.

Diary of a Digital Apprentice (2): First, a Unix tutorial

(Here for the blog-series kick-off post).

We’re playing catch-up a little, and these are things I did in the tail end of 2017.

It’s been a long time since I’ve done anything with Unix. About 10 years, actually, and my unix experience was limited to running Ubuntu at the time and being forced to troubleshoot a lot of things mainly by googling answers. That was frustrating and satisfying at the same time. A memorable highlight was the time that my system switched to Ancient Greek at some fundamental level so that I couldn’t log in because it would only input Greek characters and it was not as simple as ‘change keyboard’.

Anyway, Jedi master Tauber decided I should learn to manipulate text files in Unix and set me the following tasks. You can see them over here:

This is what I call “hunt”-learning. The teacher isn’t pushing, and the learner isn’t actively trying to pull things from the teacher, rather the teacher is setting up tasks which the learner must then go and problem-solve. I think there’s a lot to be said for such a method, and it works particularly well for something like this.

Also, by the end of 7 tasks, I had not only an appreciation for how to do these things, but a sense of both (a) the kinds of things that could be done just by manipulating appropriate data sets, (b) that so much is possible if you just have the data.

Of course, having the data, or having a text in an actionable form, is itself half the struggle.

If you’re a totally beginner like me, and want to follow through those 7 tasks, go ahead, and feel free to drop me a line if you get stuck. There’s lots I don’t know, but I know enough to hint you along the path.

Diary of a Digital Apprentice (1)

One of my goals for 2018 is to acquire a working skillset in areas of Digital Humanities. As I do so, I plan to blog regularly on that ‘mission’. In today’s post, I provide some context for the start of that journey.

 

I’d say I’ve long had a user-side interest in Digital Humanities. I’ve appreciated, and used, the considerable resources that things like Perseus, TLG,  PHI, and other packages have presented. And I’ve always envisioned ‘more’ being possible. But, being relatively short on the technical side of things, DH has always been a bit of black-box wizardry to me.

A couple of years back I made the acquaintance, first digitally, of James Tauber. Some of our initial overlap and discussion had to do with tools for language learning and teaching. We met briefly at AARSBL in 2015, and conversed a bit more since then. Another face to face meeting at AARSBL in 2017 helped solidify things and we have launched both some collaboration, but also some apprenticing.

That ‘looks like’ two things. Firstly, a combination of push-learning, pull-learning, and hunt-learning. Pull, where I ask, “how do we do X?” or “is Y possible?” and then get a crash course on how to make certain things happen. Or an explanation of “yes, Y is possible, look, Dr ABC has been working on this for umpteen years, see!”. Push-learning is where you learn things you didn’t know you could learn, e.g. “Hey, Seumas, did you know  you can use E to accomplish F, G, and H!” And hunt-learning is when James says something like, “Seumas, figure out how to do M, N, O, and P, and then tell me how you did it or when you get stuck.”

Part of this relates to the work that Eldarion is doing on developing the Scaife Viewer for Perseus. Which is incredibly exciting because (a) Perseus! (b) have you seen the Scaife Viewer demo’d? (c) it’s great to see inside the black-box so to speak, to see how something like this gets developed and figure out how it works.

Another side of it is my digital Nyssa project for the year.

“Digital Nyssa” is my project to curate/shepherd a text ((initially just one, but maybe more)) through an open, free, digital pipeline from print to digital edition. It’s both a means of acquiring practical DH skills across a range of tools (OCR, TEI-XML marking, PoS and morph tagging, digital edition creation and then commentary/annotation/translation). You’ll be hearing more about it as the year goes on, and I’ll outline a little bit more next week.

So, each week I’ll be posting up a bit of what I’ve been doing/learning/working on, as part of a bigger project to self-document the learning process for myself, and hopefully encourage others that DH is not so scary. The first few weeks will play some catch-up too on things over the past few weeks.