Word Watch: pretexting

15 September 2006

The Hewlett-Packard scandal involving its Chair Patricia Dunn hiring private investigators to spy on other board members has brought the term pretexting to the fore. Pretexting is the obtaining of private records about an individual by pretending to be someone authorized access to them. The term comes from the idea of creating a false pretext justifying access to the data.

The term is not new, however, having been around for least 14 years. From the 9 March 1992 issue of Computerworld magazine:

Another technique, called “pretexting,” is to get the data by phone after claiming to be an [Social Security Administration] employee from another office where the computer is down.

Somewhat earlier, is the more general use of the term to mean the creation of false pretenses. From the Usenet group soc.culture.vietnamese, 4 February 1992, Vietnamese Legend (The Happy Dream):

Worried at not finding him back, he sent for Sinh several times; but the latter refused to return to the Court, pretexting that he had to stay for a while to organize the administration of the occupied country.

Classifying Human Knowledge, Part 2

8 September 2006

Last week we looked at the Dewey Decimal and the Cutter Expansive Classification systems for organizing books. The other major system in use by English language libraries is the Library of Congress Classification or LCC system. The LCC is used and maintained, obviously, by the Library of Congress in Washington, DC and is also used by most of the larger libraries in the United States, including most university and research libraries.

Originally designed by Herbert Putnam in 1897, the LCC replaced Thomas Jefferson’s classification system in the Library of Congress. The top-level hierarchies are based on Cutter’s classification system, but the rest of the LCC system differs markedly from Cutter’s.

The great advantage of the LCC is that usually the same number, the one assigned by the Library of Congress, is used by all other libraries, so finding a book across libraries is easier. Its classification of technical and scientific subjects is also well regarded.

But it does have some distinct disadvantages. Unlike Dewey, the subcategories are not consistent. Each major category is subdivided on its own without consideration of what designations are used in other categories. It is also designed with the needs of the U.S. Congress in mind. So categories are often broken out geographically, even when this doesn’t make much sense for many researchers.

The LCC divides all human knowledge into 20 major categories, each one given a letter:

  • A – General works

  • B – Philosophy, psychology, religion

  • C – Auxiliary sciences of history (e.g., archeology, heraldry, geneology, biography, diplomatic history)

  • D – History, general and Old World

  • E & F – American history

  • G – Geography, anthropology, recreation

  • H – Social sciences

  • J – Political science

  • K – Law

  • L – Education

  • M – Music

  • N – Fine arts

  • P – Language and literature

  • Q – Science

  • R – Medicine

  • S – Agriculture

  • T – Technology

  • U – Military science

  • V – Naval science

  • Z – Bibliography, information resources

You can see the government influence in the breakout of law, agriculture, military science, and naval science as top level domains.

Second Tier domains are designated with a second letter. The breakout for category P, language and literature, is as follows:

  • P – Philology, linguistics

  • PA – Greek & Latin

  • PB – Modern & Celtic

  • PC – Romanic

  • PD – East & North Germanic

  • PE – English

  • PF – Other West Germanic

  • PG – Slavic

  • PH – Uralic

  • PJ – Semitic

  • PK – Indo-Iranian

  • PL – East Asian, African, Polynesian

  • PM – Native American & artificial

  • PN – General literature

  • PQ – Romance literature

  • PR – English literature

  • PS – American literature

  • PT – Other Germanic literature

  • PZ – Fiction and children’s literature

These second-tier categories are further subdivided into categories designated with numbers. The PE subclass is broken out as follows:

  • PE101-458 – Old English

  • PE501-693 – Middle English

  • PE814-896 – Early Modern English

  • PE1001-1693 – Modern English

  • PE1700-3602 – Dialects

  • PE3701-3729 – Slang

These numbers are followed by a Cutter number for the author’s last name and they year of publication. The LCC number for Word Myths is PE1584 .W55 2004.

A high-level outline of the LCC is available here. (I must go off on a rant here. The Library of Congress charges many hundreds of dollars for access to the complete classification system. While OCLC (Dewey) and the UDC consortiums also charge, these are non-profits and the fees fund the continued maintenance of the systems. The LCC, however, is paid for by the U.S. taxpayer and should be available for free to all who ask. This is unconscionable.)

So far the three systems we’ve examined have represented a single ontological methodology. They are all designed to organize books on the shelves of a library. They are all designed to group like things in a category that is assumed to be useful to researchers. The aim is to place books on the same topic in proximity to one another so that researchers can scan the shelves of the appropriate section for relevant books.

The great advantage of this system is that the collection can grow without reorganizing the catalog. As shelf space in a particular section is used up, the books in that section can be moved without all the other categories changing or moving–all that needs to be changed is the map showing where each section is. It’s a great methodology if your intent is to catalog physical items that can be located in only one place–like books.

There are some obvious drawbacks. One is that most books can be classified into multiple categories. Is T.S. Eliot an American or British poet? Does Cassell’s Dictionary of Slang belong with the dictionaries or with the books on slang? Why is the book Marley & Me, about an unruly Labrador retriever, filed under animal husbandry? If you think about it, dogs are domestic animals, but how many people are actually going to think to look under this category for a book about a house pet?

Another disadvantage is that the categories that are useful change over time. Why does the Library of Congress classify Bantu and Mandarin in the same category (PL)? Because when the categories were created, there weren’t many books in the Library’s collection in those languages. Also, the Library of Congress has a category on East Germany . This was once very useful. But now, aside from books about the period from 1945-89, it’s not terribly relevant. Do we continue to file books about the region in this category?

But most of all, in a digital age such a system isn’t that important. If your "books" are electronic, who cares where they’re "shelved." You can create "virtual shelves" of material as users demand.

One method that is increasingly being used to categorize electronic resources is tagging. It’s not a formal system and information is not classified in advance. Instead, people assign tags to a resource as they go. An electronic source about baseball slang, for example, could be tagged with terms like baseballslanglanguagesportsAmerican sports, etc. There are no restrictions on what tags can be used or on how many can be assigned to each resource.

The idea is that if enough tags are assigned by a number of different people, then patterns will begin to emerge. The baseball slang resource will appear in searches for both baseball and language. The advantage is that since the information can be classified in an unlimited number of categories the problem of looking in the wrong place goes away. With enough tags, you will find the book no matter what search path you take.

What would seem to be an obvious criticism of tagging, that there will be little consistency between how different people apply tags, actually turns out to be a great advantage. One person may tag works as being about cinema, while another uses movies. Wouldn’t it be better to use a single tag instead of both of these? Or would it? Aren’t cinema and movies really two similar, but distinctly different categories? In one you have The Bicycle Thief, in the other you have Titanic. There will be some overlap of course, with some using the same tag for both. But by ordering search results by "relevance," the person searching under cinema will be presented with The Bicycle Thief first, with Titanic being way down on the list. Someone searching on movies will get the opposite result.

And tagging makes virtual shelves possible. In the older systems, classifying 20th century American authors would be alphabetical. So Robert Heinlein would be interposed between Zane Grey and Larry McMurtry. Wouldn’t it be better to have Heinlein classified with Isaac Asimov in Sci Fi, and group Grey and McMurtry with the other Westerns? Tags let you organize 20th century American authors alphabetically or by genre, whatever your needs are at any particular moment.

Tagging is not restricted to online resources. One can also use it to classify books and other physical objects. In such a case, each book must be assigned a unique identifier (e.g., ISBN). This ISBN can then be used to locate the physical object when needed. This does not shelve these books in same place as other books on the same topic, but it is not a disadvantage in a closed stack library where books are brought to the researchers.

An example of a catalog that uses tagging is www.librarything.com, which brings us back full circle, for it was that web site that got me thinking about cataloguing. Another example is del.icio.us, a site that catalogs web bookmarks.

To read an excellent discussion of ontological methodologies and a more complete description of tagging, check out Clay Shirky’s Ontology is Overrated.

Words On The Web: www.oclc.org\worldcat

1 September 2006

The folks that bring the Dewey Decimal System, the Online Computer Library Center, or OCLC, have a great catalog search service. By visiting their web site at http:\\www.oclc.org\worldcat, you can enter in search terms and search a multitude of library catalogs for that book. You then enter in your city or postal code and the Worldcat service will give the libraries that that book in order of the distance from you.

For example, I enter in Word Myths and Emeryville, CA and I’m told that there are 408 libraries in the Worldcat system that have my book. The closest is the University of California Berkeley, some three miles away, followed by the San Francisco Public Library, across the bay some nine miles away. The farthest is the Singapore Polytechnic Library, half a world away.

This is an invaluable resource when you’re looking for a particularly hard-to-find book.

Classifying Human Knowledge, Part 1

1 September 2006

I’ve spent the last week organizing my library, a task that, surprisingly, has turned out to be quite interesting. In an effort to find a classification scheme that works for me, I’ve been looking at an learning about the various systems in use in libraries around the world.

The most famous is perhaps the Dewey Decimal System. Invented by Melvil Dewey in 1876, it is the most widely used library classification in the United States, used primarily by public and primary school libraries. The DDS divides all human knowledge into ten major divisions, each of these have ten possible subdivisions, these each have ten more, and so on. Hence the decimal.

The top level domains are:

  • 000 – Computer science, information, general works

  • 100 – Philosophy and psychology

  • 200 – Religion

  • 300 – Social sciences

  • 400 – Language

  • 500 – Science

  • 600 – Technology

  • 700 – Arts and recreation

  • 800 – Literature

  • 900 – History and geography

In the language category, for example, the subdivisions are:

  • 400 – General

  • 410 – Linguistics

  • 420 – English

  • 430 – Other Germanic languages

  • 440 – French, Provencal, and Catalan

  • 450 – Italian and Romanian

  • 460 – Spanish and Portuguese

  • 470 – Latin

  • 480 – Greek

  • 490 – Other languages

English, again for example, is broken into:

  • 421 – Writing system and phonology

  • 422 – Etymology

  • 423 – Dictionaries

  • 424 – Not used

  • 425 – Grammar

  • 426 – Not used

  • 427 – Language variations (dialects and slang)

  • 428 – Usage

  • 429 – Old English

The same numbers are used across the various categories to denote similar subdivisions. So 432 is German etymology and 482 is Greek etymology.

These categories can be further extended by numbers following a decimal point to further classify the work. The number .73, for example, denotes the United States. So the call number 427.73 is a book about American dialect. This consistent use of the same numerical combinations across all subdivisions (e.g., 973 is history of the United States) makes it easy for those familiar with the system to see how a book is classified.

Since there are many different books in these broad categories, the category number is usually followed by a Cutter number (see below) that denotes the author’s name, e.g., T911 is Mark Twain and the category 813 T911 contains fiction by Twain (81 American Literature, 3 Fiction). For prolific authors, like Twain, this Cutter number is often followed by a alphabetic sequence that either represents the title or the order in which the library acquired the book–so that new acquisitions can simply be put at the end of the appropriate shelf. So The Adventures of Huckleberry Finn might have a call number of 813 T911 Ad, or 813 T911 Fi, or, as it is shelved in the Berkeley Public Library, 813 T911zb.

It’s often thought that the Dewey system is for non-fiction only. This erroneous notion is because many libraries don’t use Dewey to classify fiction. Instead they use the author’s last name alone. This is helpful to general readers who just want to find the book and don’t care if T.S. Eliot is classified as American or British. So Huck Finn is classified as Fic Tw in many libraries. Similary, biography is often not filed in the Dewey category of 920 and instead a book of Twain’s life is filed under B Tw.

The chief problem with the Dewey Decimal system is that it is very American and European focused. For example, most of the languages of the world are crammed into the 490 category. Arabic, Native American, and Finnish can all be found here. It is kept up to date by the Online Computer Library Center, which owns the rights to the system, with new categories, like computer science, added from time to time. But it is very much captive to a 19th century American view of the relative importance of various classes of knowledge.

An improvement over Dewey is the Universal Decimal Classification or UDC. Invented by Belgian bibliographers Paul Otlet and Henri la Fontaine at the beginning of the 20th century, it is a variation on Dewey’s original system. It rarely used in the United States, but is the primary system for library classification in Britain and other English-speaking countries and can frequently be found to classify libraries in non-English-speaking countries as well. Like the Dewey system, it is kept up-to-date by a consortium of libraries.

The high level categories are similar to Dewey, except that 4 is not used and language and linguistics are grouped with literature in 8. The subcategories are organized so they are more easily extensible. You can keep adding digits to become more specialized.

The UDC also includes a notation system for denoting the relationship between categories in a book. This is especially powerful.

  • + plus sign, means that the book is about the two categories

  • / slash, means the book covers all the categories between the two numbers given

  • : colon, means the book is concerned with the relationship between the two categories

  • [ ] brackets, combines categories into a single unit

  • = equals sign, denotes the language in which the book is written.

So 31:[622+669](485)=20 is a book of statistics on mining and metallurgy in Sweden that is written in English, i>Statistics:[Mining+Metallurgy](Sweden)=English.

A third system is the Cutter Expansive Classification. Invented by Charles Cutter in the 1880s and 1990s for Boston’s Athenaeum library, it is used by only a few libraries, mostly in New England. The top level domains of the Cutter system are:

  • A – General works

  • B-D – Philosophy, psychology, religion

  • E-G – Biography, history, geography

  • H-J – Social sciences, law

  • L-T – Science, technology

  • U-Vs – Military, sports, recreation

  • Vt-W 150; Theater, music, fine arts

  • X – Philology, language

  • Y – Literature

  • Z – Book arts, bibliography

The Cutter system also denotes the size of the volume in its call number, using points (.), pluses (+), and slashes (/) to denote books of small to large size. This is very useful if over or undersized works are stored separately or for quickly locating books on the shelves.

Cutter also devised an ingenious system for classifying author’s names. Cutter created tables of two or three digits that stood for the rest of the name of an author. A214, for example, is John Adams. These tables are in use in most libraries to form the basis of the author’s name portion of call numbers.

Next week: Library of Congress Classification and tags

Words On The Web: LibraryThing.com

25 August 2006

A persistent vexation of mine is not being able to find the book I want. I know it’s on the shelf somewhere, but I just can’t find it. I’ve often spent ten minutes or more tracking down a book. My personal library is large (over 500 books), but it is by no means huge. Another issue is that I occasionally find myself buying multiple copies of a book–I forget what books I already own. I’ve often thought that I can’t be the only one with this problem and that there must be an easy way of organizing my books that someone else has pioneered.

Well, this week I discovered LibraryThing.com. It is a sublime website. Cataloguing a library of some size is never easy, but LibraryThing.com makes it nearly so. So what is LibraryThing?

First, it is a site designed to help you catalog your books. You can enter your books online–usually just a few words from the title and the author’s last name–and hit the search button. LibraryThing will create a catalog entry for you based on the catalog of the Library of Congress, Amazon.com, or any one of several dozen major libraries around the world. It will give you the Library of Congress and Dewey Decimal call numbers, the ISBN, publisher information, etc. In just a few hours I created catalog entries for over half of my books and I expect to be done by Sunday.

The search function works incredibly well. Gone are my days of going to the Library of Congress website to find data on a book. LibraryThing’s search interface is far easier and much faster. Although, LibraryThing does have trouble searching on the classics. Searching on "Dracula, Stoker", for example, turns up several hundred possibilities. These include commentaries on the novel as well as the primary work itself. And there is no easy way to sort the returned entries–a Googlesque problem. But for most books, published in a handful of editions, this is not an issue.

You can also add your own tags to the entries in your collection. So you can tag all your books on quotations, or on slang, or about dogs, or 18th century French poetry. Whatever tags meet your needs.

Your catalog data resides on the LibraryThing servers. (You have the choice of whether to keep it private or make it available for viewing to others.) But you can download it in comma or tab-delimited formats for use by spreadsheets or database programs. There are even features to allow you check your catalog from a mobile phone. (Useful when standing in the bookstore wondering if you have already have a copy of that book you are about to buy). And for those with large libraries, having a list of all your books offsite will help in reestablishing your collection in case of fire or other disaster.

The second aspect of the site is community. There are discussion forums galore. You can find other users who share the same tastes as you. (I share at least 41 books with another prominent contributor to the Wordorigins discussion forums. Go to the site and try to find him–his nearly 5,000 books puts my paltry 500 or so to shame.) Users can contribute reviews and share cataloging schemes. You can get a list of recommended books based on what is in the libraries of readers similar to you.

It’s also fun to look at some of the statistics of the books cataloged. As I write this, there are 71,838 collections, containing 5,087,028 books, of which 1,175,812 are unique works. The most popular author is (no surprise) J.K. Rowling with 37,552 copies of her books in the combined collections. Stephen King is second with 28,824. The Bard rolls in at seventh place with 15,860. The most popular book is Harry Potter and the Half-Blood Prince with 6,047 copies–Harry Potter books occupy the top six positions, all with over 5,000 copies–attesting to the reader loyalty engendered by the series. In seventh is The Da Vinci Code (4,662). And giving some solace to those who are despairing at the lack of "great" books, Orwell’s 1984 takes the eighth spot (3,835). (Ironic, as LibraryThing is the antithesis of Big Brother.) The Catcher In The Rye (3,710) and The Hobbit (3,709) round out the top ten.

The site is allows you to catalog up to 200 books for free. You can buy an annual membership for $10 that allows you an unlimited number of books in your catalog. Or a lifetime membership is just $25. So, the fees are quite reasonable and I, for one, was happy to pay to help keep such a site going.

I’m going to be spending the weekend in ontological ecstasy. I’ve already decided to going to rearrange the books by LC call number, with some variations from the official scheme that make sense for me (such as grouping all my books on toponyms together instead of regionally as the LC does). I haven’t decided whether or not to label the books with the call number. I’ll probably not do so, at least not at first.