The disappearing goal of corpus fluency: the 98/56 paradox

If we take the figure of 98% vocabulary coverage as what is needed to read a text at a comprehensible and ‘fluent’ level, so that the unknown 2% is understandable by context or not significant enough to frustrate understanding, then for the New Testament corpus, that requires learning around 56% of the total vocab. (3102 out of 5461, depending on how you lemmatise it; that’s my breakdown based on Tauber’s MorphGNT on SBLGNT).

That’s a lot of vocab, over 3000 words. It takes you well into the 2 occurrences or less bracket. And, some initial ‘soundings’ for other corpora suggest a similar ratio, to hit the magic 98% mark, you need to hit a vast amount of vocab, including a rather large amount of the low yield stuff. Now, Tauber will no doubt berate me/insist that I remind you that 98% coverage of the corpus does not equate to 98% of any particular smaller unit (See here) And in fact, he’s absolutely correct. Indeed, from those figures (over 10 years ago, how slow some of these things move), 3000 top frequency items render 81% of the verses 95% familiar, which is still a lot less than you’d like.

And the paradox is this: learning low frequency vocab (say under 5x in the NT) is incredibly low-yield, because you are only going to encounter that lemma five times as you read through the whole corpus. Let alone a 2x frequency word. So the pay-off in terms of understanding is low, but also your ability to learn that word is far more difficult, because it’s not being repeated enough for you to encounter it frequently. Indeed, to encounter any of those low frequency New Testament words would require you to encounter them outside the New Testament.

Which is why this is a paradox of sorts – to master a particular corpus, whether that’s Plato, Demosthenes, or the NT, actually requires reading outside the corpus because that’s the only way you’ll get enough context, exposure, and repetition to render the corpus’s low frequency vocabulary meaningful in the context of the broader language (something that living in, say, Ancient Greece would have done for you, but you have to make do with what you’ve got).

Moral of the story? If you want to master a narrow corpus, you will have to read more widely than that corpus.

σπεῦδε βραδέως (can you fast track language learning?)

One of the critiques I (πολλάκις; ἐνίοτε;) encounter about a living language methodology is that it’s slow. That it doesn’t get us directly to reading texts (the main interest of most historical language students), That it is inefficient (why do I need to learn the word for ‘butcheress’ if it only appears in LXX 1 Kings 8.13).

I want to mount something of a defence here, though a gentle one.

  1. You can only go so fast
  2. It’s basically sheer hours, not sheer speed, that charts your progress
  3. Where are you trying to get to?


You can only go so fast

Right, so given that you have X hours in class, or X hours studying, there’s only a finite mount of material you can cover. If that’s English description of Syriac (all my arguments are applicable to most historical languages, so let’s mix it up today!), then your actual exposure to Syriac ‘input’ is going to be very, very limited. It that’s Syriac input, you’re only going to be able to comprehend very, very simple messages at the start, because you basically don’t know enough to understand anything more.

So: grammar-based curriculum: you can proceed through grammar faster, but you’re ability and time spend exposed to L2 input is severely limited, and your speed at translating that material is going to be slow. Let alone ‘reading’.

Communication-based method: you will proceed through ‘grammar’ or ‘vocab’ much slower, but your input should be much, much higher. So your ability to understand Syriac in language will be stronger, earlier, faster, but still limited (albeit by a different factor).

And nothing much is really going to speed these things up. Sure, you can teach all the grammar up front, and do nothing in Syriac, but then all you’ve done is present a bunch of information – charts and explanations about how Syriac works, but you haven’t learnt Syriac, and in fact you haven’t even read much Syriac (if any), and so it’s after all that grammar that you have to go away and do that work of actually learning the language.

(this is actually how most Grammar/Translation courses really work: learn a description of the language, and only then do you really go out and try to learn the language. If we want to do that, I think we could do it better by explicitly saying that’s what we’re doing: “Hello students, welcome to introduction to Syriac grammar. Over the next X weeks, we’re going to provide an external, English-based description of how Syriac language works. Then next semester you can start learning Syriac.

This is also, let me say, what happens if you try and go faster than people can understand in a in-language approach. If you start outstripping students’ comprehensible levels, you have no choice but to either (a) start explaining all the language they can’t understand, and/or (b) ignore them and present language that is beyond them and so of decreasing-to-zero comprehensibility.

It’s basically sheer hours, not sheer speed, that charts your progress

What’s attractive about the above is that you can get a student, or a cohort, to the end of the year (or other arbitrary unit of time) and say, “Great, we’ve covered all of Syriac!”

Except you haven’t. That’s a lie, isn’t it? Hence the title of my post, you’ve rushed through grammar but you haven’t developed any proficiency in understanding messages in Syriac.

Based on the reading in SLA theory I’ve done, hours are a better measure of progress than most other things. Sure, learners have a bit of fluctuation, but if the main determinant is comprehensible input, and there’s not really a way to speed up certain acquisition processes, then it’s simply hours that provides a fair estimate of how far along you are. This seems to be backed up by what, for instance the kinds of hours-estimates you see for CEFR based standards.

Want to ‘go faster’? It’s not method that’s the issue, it’s time spent in the language with messages you understand. You can go faster, if you can spend more time day after day.

Where are you trying to get to?

I do get a little defensive on this point. I recognise that most historical languages students don’t want to learn to order a latte in Syriac. But, at the same time, the ability to do so is not irrelevant. Sure, learning how to say “latte” is one piece of extra information that won’t help you read the Peshitta, for instance, but it’s also not a huge burden. Rather, what does it say about us that most students couldn’t, without a great deal of difficulty, string together a sentence to ask for a basic, modern food item. (Don’t @ me about how lattes are neither basic nor food).

So I, like most teachers and students, want students to end up with an ability to read target language texts with understanding. In my ideal world, CEFR B2, or ACTFL Intermediate-High or Advanced-Low is a reasonable benchmark to aim students towards. Sure, they’re not ‘fluent’ (which is itself a super-difficult term to pin down, but I tend to peg ‘fluent’ to C2), but they’re going to be able to read most texts with minimal aids, and understand them, and have a fundamental conversation about those texts – maybe not at the same B2 level, since we’re text/reading focused, but I’d want to see students sustain a conversation about a text they can read at B2, at B1.

And, if you can get to B2, you ought to be able to ‘add on’ enough explicit grammar, in the L2 but also in your L1, to ‘talk grammar’ about a text. Again, in my ideal world L1 discussion of L2 grammar would be hived off into a separate component of any course, and delayed somewhat to help students not get sucked into a mentality of “okay, here’s an L2 message, let’s analyse its grammar while using L1”)

B2 seems, to me, also a high but reasonable standard to say, “okay, you should be able to sustain this level and improve it outside an educational facility, primarily by reading more”. A2 isn’t enough for that, B1 is borderline. We all know, though, that plenty of grammar-translation graduates reach great heights of analysis, but lose most of their language in a few short years post-college.

If hour estimates are correct, then it’s a full 800 of “teacher-led” hours to get to B2. That’s a really big ask. It requires reconceiving the length of a course of language instruction, the dynamics of the required hours, and a whole range of issues.

And yet, even to get students to A2 is going to take 200 hours or so. That’s still a lot of hours. You can’t take a month long evening course and expect to be fluent in Syriac. You might be able to explicitly learn the linguistic features of Syriac in English in a month, but that is an entirely different thing.

And so, if you’ve put up with me this far:

  • Stop trying to cram everything in. It doesn’t work and it’s not effective, unless you redefine efficacy to mean cramming.
  • Drastically raise your idea of the hours you’re going to have to commit to a language to get truly ‘decent’ at it (let alone ‘master’)
  • Drastically reduce (some of you!) how soon you’ll be able to do more than the basics.
  • Don’t lose heart – language acquisition isn’t that much about talent or aptitude (maybe not at all, I think), but persistence and time-invested.
  • You don’t get to throw out a communicative method as irrelevant or ‘doesn’t work’ until you’ve put in a good 600 hours thanks. Then come back and tell me how it doesn’t work.

What (Greek) pronunciation should I use?

I’m surprised how often this question comes up. But I’m asked this relatively commonly.

First, a very, very brief précis of the main options

  1. Reconstructed Koine (Buth or Buth-similar options)
  2. Erasmian (US)
  3. Restored Attic (Allen-Daitz)
  4. Modern
  5. UK traditional

Erasmian is what has been taught ‘traditionally’, i.e. for a few hundred years, and is dominant in US circles. It is often typically ‘infected’ with American-English vowels, which makes it even further removed from both Erasmian and historical accuracy. Both (1) and (3) are serious attempts to reconstruct and produce the sounds of Greek as pronounced in the 1st CE and 5th BCE, more or less. (4) is a recognition that Greek continued to evolve, and is in wide usage among non Anglophone contexts. It’s also quite appropriate for Byzantine texts. (5) I only mention because it exists as it’s own relatively idiosyncratic, but prevalent for a long time, system.

What should you use? Here are my principles:

  1. A pronunciation system that is historically accurate for the period that is your major interest.
  2. A pronunciation system that enables conversation with other speakers.


The first of these makes good sense – if you’re reading Imperial/Koine texts, you probably ought to read with a scheme that matches. If you mainly read classical texts, by all means use a classical pronunciation. If you predominantly read Byzantine texts, I’d shift one’s pronunciation to Byzantine or Modern.

But, I wouldn’t suggest trying to alter one’s pronunciation based on the period of text. So, I read classical texts with a Koine pronunciation. Which is, by the way, what I imagine the 4th century church fathers did too! I can manage a classical accent, though never well, but I don’t normally try to.

The second principle is an acknowledge that Greek should be spoken and ‘lived’, and if you’re in a context where a different scheme prevails, you should consider accommodating. E.g., I teach a class where the students are used to a pronunciation much closer to Modern. I don’t fully accommodate, but I do partially accommodate, and I don’t error-correct on pronunciation (or on other things, really).

Getting a good base in one pronunciation, you can understand others, if you’re mindful. Just as I can understand a Latin speaker with ecclesiastical (even if, to be honest, it does grate me a little), I can converse (not at a high level!) with someone using Erasmian or Modern, provided I’m mindful of the difference and mentally adjust a few words.

In sum, I think good reasons for using practically anything except Erasmian (and least of all UK traditional!), but anything else just stop worrying about it so much, you have much bigger problems ahead of you in the language learning journey.

Should I read more easy/intermediate/hard reading material?

Inspired by a recent conversation, and again a question that I get every now and again.

My suggestion is to weight your reading towards the easy (anything you can read with 98-100% comprehension, at a good reading pace), with some intermediate (anything you can read with 90% comprehension, and occasionally might need to pause to figure something out, or look up a very occasional word.

This should be your staple, for language acquisition purposes.

What about hard? Intensive reading? Figuring out that damnable Horace?

Here’s my question: Is there any pressing reason for you to read texts beyond your current proficiency?

If the answer is yes, e.g. you’re doing a course, you’ve got exams, you’re translating something for money, you’ve got a life-geas to understand Horace, then yes, spend some time in intensive/hard reading. Look up every word, diagram those sentences, get out your Loeb, do whatever it takes to make that text understandable. And then, read it, and re-read it, and make it your own. Tame that text. Memorise it. Domesticate it.

But if the answer is no, then why? From a language acquisition perspective, the time you spend toiling over figuring out 20 words of poetry, might have been spent reading pleasantly and rapidly through 20 pages (well, maybe not 20) of not so difficult material, which is still building up your language proficiency, still giving you input, still working you towards the day when those 20 lines will make a lot more sense, with a lot less effort.

So, if time and circumstances are on your side (and even if they are, to some extent, not), I prefer to weight readings towards the easy.

“The Switch” : Thinking in a foreign (dead) language

A couple of people have asked me recently what this is like, and expressed something of their frustration that they can’t go from “mental translation” to “thinking in (Greek, Latin, Hebrew, Klingon)”

I do think this is hard to explain, especially if you haven’t experienced it for any language – if you’re a monoglot who has always thought in one language as long as you can remember, and your primary (or only) experience of language learning is grammar/translation/dead-languages, this is just hard to conceptualise.

So, here’s a suggestion. Watch some youtube:

Here’s day 1 TPR in French, or Rico teaching Greek, or do a search for anything TPR, TPRS, WAYK and ‘day 1’.

These are classes in which you should be able to just watch, and match language to action. Sure, I know there’s some translation going on in your head. I would generally discourage that, but also don’t stress about it. In saying I’d discourage it, I really mean, “don’t try to translate”, I don’t mean, “fight against any instinct to translate”

For me, personally, operating in an L2 is a bit like flicking a ‘switch’. One minute I’m in English, and then I flip that switch, and start thinking in, say, Gaelic. And I keep thinking in Gaelic and talking in Gaelic, as much as I can. That tends to only get disrupted when I hit a point where I can’t find the word for what I want to express. That’s like an obstacle in my mental ‘flow’, and it will get filled. Sometimes it gets filled by random other language (so I will almost always have a random word from one language slip into others), or English. If it’s not a big deal, you can kind of flow around that and keep going. Languages where I’m really fluent (like I used to be in Mongolian), I can just keep operating in the language on and on, and in fact dropping in some English won’t disrupt me. That’s genuine code-switching, rather than language interference.

The thing I’d say is, just keep at it. “It” here is reading, listening, exposing yourself to as much easy, comprehensible language input as you can. The more you can pile this up, the more input you’re getting, the more you will be able to make that jump. Don’t stress about it, but don’t keep encouraging the translation habit.

Start a spoken (Latin/Greek/Whatever) club today: my biggest classics regret

My biggest regret is that in all my time as a student I never took the step to start a group to talk Latin or Greek.

Now, admittedly I did a lot of my classics, in particular, as a distance student, and as a disconnected graduate student, but for most of my student life I was still convinced that active, communicative language approaches were (and are) invaluable. But I didn’t have the confidence, either in myself or in speaking, to start such an enterprise. I was waiting, I don’t know what for. This, I think, was a big mistake.

I don’t really care if you’re at an institution and all your teachers are super-conservative die-by-the-grammar types. If you think speaking Latin (etc.) is  good idea, start now.

If you’re not confident/have no idea, here’s my prescription:

  1. You can pick up a text designed for spoken work: Polis Institute’s Polis for Greek, Forum for Latin, are a good choice. If you’ve got some language under your belt, then the level of language isn’t your barrier, it’s having a source of inspiration to use to bootleg or jumpstart the speaking side of things.
  2. It’s actually incredibly easy to have super-basic conversations about a text in language.  Grab out your dog-eared copy of Oxford Latin Course (uel similis, ach please not Wheelock; Athenaze in a pinch), and do basic conversational comprehension: quis, quem, quid facit, cuius, quo, et cetera. Then just extend it a little: cur? ut quid faciat? and so on. (With some time, I’ll mock up some of such basic conversations).
  3. Remember, you don’t have to talk at any particular level, you certainly don’t need to talk at the level you read! Just strip it down to the most basic, start there, take notes on things you suddenly realise you don’t know how to say, and look them up later.

Oh, you like me have no friends except on the internet? Time to commit to internet chat times. This is 2018, you can find someone(s) to chat Greek to. Not many, but they are out there.

So don’t repeat my mistake, start your classics conversation group today. And if you do have grammar-translation-loving professors, just run it right outside their office.