I didn’t find the dark matter statistic that surprising, because it evoked somewhat vague memories of a prior post about hapax legomena and Zipf’s law (started by happydog: I believe happydog’s point was that an author about that issue had improperly said that math “shapes” our use of language, but I was struck by zipf’s law itself).
I seem to recall that, among other things, zipf’s law dictates that, in a given corpus, hapax legomena, or words that are used only once in that corpus, make up about 50% of the words used (this wasn’t zipf’s main point, it is just one thing that follows from his law).
When the corpus is published works, it seems likely that hapax legomena will make up a somewhat lower percentage of the words used (since such words are disproportionately likely to be pruned by an editor). But, even in such a corpus, hapax legomena likely make up a significant percentage of the words used.
If the corpus is all published works, it seems unlikely that most hapax legomena would show up in most dictionaries, and, for that matter, words used only twice, or three times, etc are unlikely to get snatched up. So, it’s not too surprising that a little over half of the words used in published works are dark matter.