Sore Thumbs in Subcorpus vocabulary

Lately James Tauber and I have been collaborating (him doing lots of work, me coming up with new things to do) on vocabulary data.

My particular driving interest was how to structure the writing of LGPSI to introduce vocabulary, from the start, that will build students towards a strong basis for reading post-introductory texts. Instead of a pure frequency approach, LGPSI has as its broad focus providing just more and more and more exposure to the most frequent words. At some stage I hope to utilise specific lemmatisation work on LGPSI itself to improve the text’s own introduction of vocabulary, but my major focus at the moment is thinking how the larger arc of the text serves a long-term goal of Greek learners.

The subcorpus we selected deliberately aims at both the New Testament, and some post-beginner prose texts (selections from Plato, Lysias, Xenophon, Thucydides, Herodotos, and a small amount of Demosthenes and Isocrates), as the most likely ‘second contact’ for Greek learners. In this post, I want to reflect a little on ‘sore thumbs’ – when we look at the frequency data of the subcorpus as a whole, but notice a discrepancy between those sub-subcorpora. In the notes below, I’ll simply give the percentage point at which a word occurs in terms of coverage, for the subcorpus, and for the GNT.

Exhibit A:

Words common in broader Greek, but relatively absent in the New Testament.

δή (50.97%, 94.65%)

The first word that we encounter that’s relatively common in our subcorpus but not the GNT is δή. And what I’d say is that it’s emblematic of the fact that the broader and richer particle usage in Greek is not reflected in the GNT. This is one challenge for students moving from NT Koine to broader Greek – encountering and mastering a range of particles.

(see also τοίνυν, ἦ, οὐκοῦν and οὔκουν

(54.88%, 85.14%)

Using ὦ in vocative address is relatively uncommon in the GNT.

τοιοῦτος (56.28%, 80.18%)

The ‘extended’ set of qualitative and quantitative demonstrative-type adjectives are under-represented in the GNT.

ἐπεί (58.01%, 83.07%)

This one surprised me (as did a few), simply because I have become rather use to ἐπεί

οἴομαι (60.19%, 97.34%)

With only 3 NT occurrences, this is a high-frequency non-GNT word that is remarkable absent in the GNT.

βούλομαι (60.5%, 83.8%)

Similarly this one. This makes me suspect that the GNT has some systematic preferences, e.g. θέλω in place of βούλομαι. Students transitioning from NT-only Koine to broader Greek would benefit from just some advance notice on these.

οἷος (62.16%, 90.42%)

Similar to τοιοῦτος above.

ὥσπερ (62.63%, 84.15%)

For a word that I use conversationally quite a bit, I was again surprised by its infrequency. I have a hunch this is perhaps reflected in a preference for ὡς.

ὅδε (66.3%, 91.36%)

The nearest demonstrative is drastically under-represented in the GNT, probably reflecting its disuse among contemporary speakers. It would be interesting to compare this with a broader Koine corpus.

πολέμιος (66.47%, n/a)

This is the first, but not the only word, that is reasonably frequent in our subcorpus, but appears zero times in the GNT. (σφεῖς is the next most common one).

μάλιστα (67.34%, 91.41%)

Another reasonably common word (and useful conversational one!) dramatically absent in GNT.

τυγχάνω (67.42%, 91.97%)

χρή (69.94%, hapax)

Remarkable that this is a hapax in the NT (James 3:10)

ἀφικνέομαι (70.23%, hapax)

Again remarkable to me, having become used to so many people ‘arriving’ in various places. (Rom 16:19)

I could go on, to be honest. But one of the things this brings home to me is that those who learn with only NT Greek, need to have their sense of Greek ‘normed’ a bit against a broader Greek corpus. Not necessarily this corpus, but nevertheless ‘broader’ Greek. How can you have a sense of Greek usage if your only reading material is the GNT corpus?

Secondly, it sets some bench marks for my own writing. The vocabulary data James is generating lets me check my own tendencies, which very often are shaped by what I’ve been reading lately! To ask the question of, “is this word relatively common, or uncommon?” And to shape the vocabulary introduction in LGPSI around benchmarks of 1000, 2000, 3000, 4000 lemma.