Where Do English Words Come From?

A recent thread on this site’s discussion forum got me thinking about the borrowing of foreign words into English, a language with a reputation for indiscriminately appropriating words from other languages. Etymology books and websites, including this one, often highlight the diversity of languages that English draws its vocabulary from, but how much of this reputation is deserved? Does English really borrow that many words? And does English really filch a lot of words from many different languages? The answers may be a bit surprising, but when you look at the data in light of history, they make a lot of sense.

To answer these questions, I collated all the data from the OED etymologies into a spreadsheet. The data is not perfect and must be taken with a grain of salt, but is generally representative of where English draws its words from. First, the data reflects the number of times a language is mentioned in the etymology; it does not necessarily represent the source languages that English draws its words from. Cognates—words with a common root—are frequently mentioned in the etymologies, and these are represented in the numbers as well. (For example, there is a single appearance of Sanskrit in the etymology of an Old English word. This does not mean that the Anglo-Saxons had some contact with India and borrowed the word. Rather the OED editors simply pointed out that the Old English word has a cognate in Sanskrit, i.e., they share a common Indo-European root.) Second, the OED predominately draws from literary sources. Drawing from a different corpus would result in somewhat different numbers. For example, if one were to poll technical writing, the number of words with Latin and Greek roots would rise. But these additional words would also tend to be rather arcane and uncommon, so the OED is a pretty good representation of the breadth of English as the language is spoken by most people.

The chart here represents the sources of English words as a percentage of the total new entries the OED has for that particular century. By using percentages, I correct for the bias the OED has for drawing upon sources from certain centuries. For example, since the OED was primarily compiled in the nineteenth century, the dictionary has far more new words from that period than any other—almost as many as the seventeenth and eighteenth centuries combined, and more than twice as many as the twentieth. Using percentages allows comparisons across the centuries.

The most striking thing upon viewing the chart is the realization that by far, most English words are formed from already existing English roots. Four times as many words come from English roots as come from Latin, the closest competitor. For all its reputation for borrowing words, most English words are home grown. And most of the non-English words come from only a handful of sources: Latin, French, Old Norse, Greek, and non-language sources, such as acronyms, echoic and imitative words, personal and place names, etc. And the influence of these sources changes over time. The high point for English-rooted words was the Anglo-Saxon era, with over 80% of Old English words coming from native, Germanic roots. The low point was the fourteenth century, when the effects of the Norman Conquest were fully felt. Since then, the amount of borrowing has decreased, and by the twentieth century over 70% of new words were formed with English roots.

By far the biggest impact on the language has been the Norman Conquest of 1066. Prior to the arrival of the Normans, the only major influence on English that shows in the data was Latin, and the OED data probably over represents that influence due to the fact that most extant manuscripts from the Old English period were copied by monks, so ecclesiastical and other Latinate terms are probably more common in the surviving writing than in the everyday speech of the period. We can also see that following the Conquest, Latin all but disappears as a source of new words for several centuries. This is undoubtedly due to the fact that Anglo-Norman French, not English, was the language of the ruling class. Scholarly, literary, and theological writing, which would use Latinate terms, was far less likely to be written in English than it had in the Anglo-Saxon era. But in the fourteenth century—the time of Chaucer—we start to see the re-emergence of Latinate terms, as English starts to re-establish itself as the literary language of Britain. Latin would reach a peak in the seventeenth century, the Enlightenment, when over a quarter of new words were formed from Latin roots. Since then, Latin’s influence has declined. Greek, also associated with technical and scientific vocabulary, doesn’t begin to emerge as an influence until the seventeenth century.

Also as a result of the Conquest, French displaced Latin as the primary source of borrowing for a time. The thirteenth century, some two hundred years after the arrival of the Normans, was the peak of borrowing from French, with over a third of new words coming from French roots, and which also corresponds to the low point of Latin’s influence. So it took about two centuries for French to make itself fully felt. But as new generations of the aristocracy increasingly spoke English, not Anglo-Norman, as their native tongue, the borrowing from French declined. By the sixteenth century and the start of the Early Modern Era, French influence on the English language had sharply decreased.

An oddity in this data is the dating of the Old Norse influence. Overall, Old Norse has not had a large influence on English, with only about 1% of English words coming from Old Norse roots, but in the twelfth century some 10% of new words had Old Norse roots. The odd thing is that this should be much too late. The peak of Old Norse influence should date to the Anglo-Saxon period, when Danes, a.k.a., the Vikings, settled much of northern England. I don’t know why the data shows a delay of more than two centuries. The total number of new words from the eleventh and twelfth centuries is small compared to other centuries—a result of far less things being written in English following the Conquest. It may be that the Norman influence was less strong in the north of England, where the Old Norse influence was strongest. Perhaps a higher percentage of the relatively few English-language manuscripts were produced in the north during this period, resulting in more Old Norse words making it into the OED corpus in this later period. But that’s just a guess. A closer examination of what is going on here is needed before a definitive conclusion can be drawn.

Other European languages remain steady, contributing 3–4% of new words throughout the centuries. The exception is German, which starting in the eighteenth century begins to increase its contribution to English vocabulary, reaching 3% of new words all by itself, and nosing ahead of French by the twentieth century.

Non-European languages remain a negligible source of English words until the seventeenth century. Then, with voyages of discovery and colonialism and imperialism creating contact with languages from outside of Europe, words from non-European languages began to creep into the language. But by the twentieth century, all these languages combined were contributing less than 3% of new English words, and no single language makes a significant contribution.

The final contributor is non-language sources, things like personal and place names, acronyms, echoic words, and the like, including the infamous “origin unknown.” For most of English’s history, these sources contributed some 3–4% of words, but in the eighteenth century this rose to 5% and then to 10% in the twentieth. This is of note because it represents a significant shift in how new words are created.

I’m including two other charts which visualize the same data in different ways. Instead of using a percentage, I’ve normalized the raw numbers to eliminate the sampling bias in particular centuries. The Y-axis doesn’t represent a count, but is simply a magnitude. The absolute number, from 0–201, associated with a particular source in a given century has no meaning except relative to other numbers on this chart.

If you wish to play with the data yourself, it’s here in Excel format (WordOrigins_rawdata.xlsx) and in comma delimited format (WordOrigins_rawdata.csv).

[Data source: Oxford English Dictionary Online, accessed May 2014]

