The disappearing goal of corpus fluency: the 98/56 paradox

If we take the figure of 98% vocabulary coverage as what is needed to read a text at a comprehensible and ‘fluent’ level, so that the unknown 2% is understandable by context or not significant enough to frustrate understanding, then for the New Testament corpus, that requires learning around 56% of the total vocab. (3102 out of 5461, depending on how you lemmatise it; that’s my breakdown based on Tauber’s MorphGNT on SBLGNT).

That’s a lot of vocab, over 3000 words. It takes you well into the 2 occurrences or less bracket. And, some initial ‘soundings’ for other corpora suggest a similar ratio, to hit the magic 98% mark, you need to hit a vast amount of vocab, including a rather large amount of the low yield stuff. Now, Tauber will no doubt berate me/insist that I remind you that 98% coverage of the corpus does not equate to 98% of any particular smaller unit (See here) And in fact, he’s absolutely correct. Indeed, from those figures (over 10 years ago, how slow some of these things move), 3000 top frequency items render 81% of the verses 95% familiar, which is still a lot less than you’d like.

And the paradox is this: learning low frequency vocab (say under 5x in the NT) is incredibly low-yield, because you are only going to encounter that lemma five times as you read through the whole corpus. Let alone a 2x frequency word. So the pay-off in terms of understanding is low, but also your ability to learn that word is far more difficult, because it’s not being repeated enough for you to encounter it frequently. Indeed, to encounter any of those low frequency New Testament words would require you to encounter them outside the New Testament.

Which is why this is a paradox of sorts – to master a particular corpus, whether that’s Plato, Demosthenes, or the NT, actually requires reading outside the corpus because that’s the only way you’ll get enough context, exposure, and repetition to render the corpus’s low frequency vocabulary meaningful in the context of the broader language (something that living in, say, Ancient Greece would have done for you, but you have to make do with what you’ve got).

Moral of the story? If you want to master a narrow corpus, you will have to read more widely than that corpus.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: