HD: Lexical Dark Matter
Posted: 06 June 2012 05:37 AM   [ Ignore ]
Administrator
Avatar
RankRankRankRankRank
Total Posts:  4752
Joined  2007-01-03

Missed this one the other day. But here it is, thanks to Languagehat.

Profile
 
 
Posted: 06 June 2012 05:23 PM   [ Ignore ]   [ # 1 ]
Avatar
RankRankRankRankRank
Total Posts:  3109
Joined  2007-02-26

“52 percent of the English lexicon – the majority of the words used in English books – consists of lexical ‘dark matter’ undocumented in standard references.”

Must say, that’s an astounding statistic.

Profile
 
 
Posted: 06 June 2012 07:26 PM   [ Ignore ]   [ # 2 ]
RankRankRank
Total Posts:  355
Joined  2012-01-10

I didn’t find the dark matter statistic that surprising, because it evoked somewhat vague memories of a prior post about hapax legomena and Zipf’s law (started by happydog: I believe happydog’s point was that an author about that issue had improperly said that math “shapes” our use of language, but I was struck by zipf’s law itself). 

I seem to recall that, among other things, zipf’s law dictates that, in a given corpus, hapax legomena, or words that are used only once in that corpus, make up about 50% of the words used (this wasn’t zipf’s main point, it is just one thing that follows from his law).

When the corpus is published works, it seems likely that hapax legomena will make up a somewhat lower percentage of the words used (since such words are disproportionately likely to be pruned by an editor).  But, even in such a corpus, hapax legomena likely make up a significant percentage of the words used. 

If the corpus is all published works, it seems unlikely that most hapax legomena would show up in most dictionaries, and, for that matter, words used only twice, or three times, etc are unlikely to get snatched up.  So, it’s not too surprising that a little over half of the words used in published works are dark matter.

Profile
 
 
Posted: 07 June 2012 03:44 AM   [ Ignore ]   [ # 3 ]
Administrator
Avatar
RankRankRankRankRank
Total Posts:  4752
Joined  2007-01-03

I seem to recall that, among other things, zipf’s law dictates that, in a given corpus, hapax legomena, or words that are used only once in that corpus, make up about 50% of the words used (this wasn’t zipf’s main point, it is just one thing that follows from his law).

Is this really true? The number of hapaxes should depend heavily on the size of the corpus, the larger the corpus the fewer hapaxes. (To a point, after which the curve should flatten for very large corpora.) The percentage of unique words should be high in a large corpus, but 50% seems too high. And a quick, back-of-the-envelope sampling of hapaxes in Chaucer’s corpus shows that about 30% of the words are unique occurrences in his works. I haven’t counted, but I’d be surprised if the number of hapaxes in Old English approached 30%. (My swag would be 5–10%.)

I would think a significant factor in the “dark matter” is the editorial practices of dictionaries. Things like proper names and the formal names of animal species (which probably make up 50% of the “English” lexicon in and of themselves) are not generally included in “standard reference works.” Throw in foreign words that make their way into English-language books, and you can easily hit 50%. It’s not so much that literary works use a lot of obscure words, as dictionaries only cover the more common ones and leave out whole classes of words.

[ Edited: 07 June 2012 04:09 AM by Dave Wilton ]
Profile
 
 
Posted: 07 June 2012 04:05 AM   [ Ignore ]   [ # 4 ]
Avatar
RankRankRankRank
Total Posts:  1181
Joined  2007-02-14

There’s also the question of whether, if you have look as a dictionary entry, but not looks or looked does that mean that we have two words that are not in the dictionary?

Profile
 
 
Posted: 07 June 2012 04:10 AM   [ Ignore ]   [ # 5 ]
Administrator
Avatar
RankRankRankRankRank
Total Posts:  4752
Joined  2007-01-03

I’m assuming they’re talking about lemmas, and not inflected forms.

Profile
 
 
Posted: 07 June 2012 08:01 AM   [ Ignore ]   [ # 6 ]
Avatar
RankRankRankRankRank
Total Posts:  2842
Joined  2007-01-31

The percentage of unique words should be high in a large corpus, but 50% seems too high.

True, but “not in dictionaries” does not equal “hapax legomenon”. 

In addition to proper names and biological names, there are literally millions of chemical names, many of which would show up in a corpus that included scientific books but would not be listed in dictionaries.

Profile
 
 
Posted: 07 June 2012 05:31 PM   [ Ignore ]   [ # 7 ]
Avatar
RankRankRankRankRank
Total Posts:  3109
Joined  2007-02-26

I’ll pay biological and chemical names ... but I think they are cheating if they are including proper names.

Profile