The disappearing goal of corpus fluency: the 98/56 paradox

If we take the figure of 98% vocabulary coverage as what is needed to read a text at a comprehensible and ‘fluent’ level, so that the unknown 2% is understandable by context or not significant enough to frustrate understanding, then for the New Testament corpus, that requires learning around 56% of the total vocab. (3102 out of 5461, depending on how you lemmatise it; that’s my breakdown based on Tauber’s MorphGNT on SBLGNT).

That’s a lot of vocab, over 3000 words. It takes you well into the 2 occurrences or less bracket. And, some initial ‘soundings’ for other corpora suggest a similar ratio, to hit the magic 98% mark, you need to hit a vast amount of vocab, including a rather large amount of the low yield stuff. Now, Tauber will no doubt berate me/insist that I remind you that 98% coverage of the corpus does not equate to 98% of any particular smaller unit (See here) And in fact, he’s absolutely correct. Indeed, from those figures (over 10 years ago, how slow some of these things move), 3000 top frequency items render 81% of the verses 95% familiar, which is still a lot less than you’d like.

And the paradox is this: learning low frequency vocab (say under 5x in the NT) is incredibly low-yield, because you are only going to encounter that lemma five times as you read through the whole corpus. Let alone a 2x frequency word. So the pay-off in terms of understanding is low, but also your ability to learn that word is far more difficult, because it’s not being repeated enough for you to encounter it frequently. Indeed, to encounter any of those low frequency New Testament words would require you to encounter them outside the New Testament.

Which is why this is a paradox of sorts – to master a particular corpus, whether that’s Plato, Demosthenes, or the NT, actually requires reading outside the corpus because that’s the only way you’ll get enough context, exposure, and repetition to render the corpus’s low frequency vocabulary meaningful in the context of the broader language (something that living in, say, Ancient Greece would have done for you, but you have to make do with what you’ve got).

Moral of the story? If you want to master a narrow corpus, you will have to read more widely than that corpus.

2 responses

  1. While sitting a roll out last night at my jiu-jitsu training, I was thinking about this topic and recalled this blog post. Reading to master a particular corpus requires reading outside the corpus because that’s the only way you’ll get enough context, exposure, and repetition to render the corpus’s low-frequency vocabulary meaningful in the context of the broader language. Okay. So what does this equate to time-wise? If we follow A.T. Robertson’s advice of half-an-hour of the Greek New Testament a day, and Adolf Deissmann’s opinion of an hour a day of the LXX, how much more time would a person need to read Attic/Koine Greek per day and what text(s) to achieve mastering a particular corpus (i.e, the Greek New Testament, LXX etc.)? —
    Reading Greek: 0:30 + 1:00 + ? = ?

    • I don’t know and I’m not sure you can calculate it anyway. It’s a bit “how long is a piece of string?”

      Because, the more you read outside the (NT) corpus, you still need to read a great deal if you’re trying to hit those low frequency words (and forms, but I’m only thinking about words today). Some of those NT low-freqs are more frequent outside the NT, but some are just generally low-freq words. And then there’s the issue of trying to curate one’s own reading, i.e. how do you work out what’s worth reading that is going to help your core corpus goal? Theoretically this could be done through algorithms, but we’re not quite there yet.

      Going back to the Paul Nation stuff, the more you try to expand your vocab to less frequent words in a language, the greater and greater the reading demands to acquire that expanded vocab.

      My advice, then, is always, “just read, as much as you can manage”.

%d bloggers like this: