Computer-assisted Lexicography

Russon Wooldridge

University of Toronto

January 2005 ; online version of R. Wooldridge, "Lexicography", in A Companion to Digital Humanities, Oxford-Maldon-Carlton: Blackwell, 2004, p. 69-78.
© 2005 R. Wooldridge

Lexicography is here considered in the literal and concrete sense of the word : the writing of the lexicon, the ordered description of the lexicon of a language in the form of a reference work usually called a dictionary.

The following is intended as a typological and partial account of the use of computers in lexicography, dealing with the essential applications and the main examples, for English and French, of computer-assisted lexicographical products. The dictionaries considered are those intended for the general public ; scant mention will be made of those created for specialists.

1. Nature of the dictionary text

The dictionary has fundamentally the same structure as a telephone directory, a hospital's or general practitioner's medical files, or a library catalogue. Each unit of these collections is a record containing a number of fields, potentially the same for each record (some fields are blank) and placed in the same order, the essential characteristic of this relational database being its recursiveness.

  • Telephone directory entry : name, address, telephone number.
  • Medical record : name, personal coordinates, medical history, progress notes, consultations, lab reports, etc.
  • Library catalogue record : title, author, place and date of publication, subject, material, ISBN number, holdings, etc.
  • Dictionary entry : headword, pronunciation, part of speech, definition, examples, etymology, etc.

    Example of two dictionary entries (source : Dictionnaire universel francophone) :

    The recursiveness of the informational fields of the above two dictionary entries is indicated by typography, position and abbreviation : 1) headword in bolded large letters ; 2) part of speech conventionally abbreviated ; 3) definition ; 4) examples of usage in italics. Fields 1, 3 and 4 are given in full because of the idiosyncratic nature of lexical units ; field 2 is given in abbreviated form since its values belong to a small finite class. Typography, position, abbreviation and ellipsis (none of the four fields is explicitly named) are the features of dictionary recursiveness and economy (the dictionary is also a commercial product). Occasional fields tend to be named : "Syn." for synonyms ; "Encycl." for encyclopedic information (normal, systematic, unlabelled information is linguistic) ; "V." for cross-reference to related terms.

    Even the simplest dictionary entries, such as the ones quoted above, tend to be structurally complex. Besides the main binary equations – a) dictionnaire = masculine noun ; b) dictionnaire [means] "Ouvrage qui recense et décrit, dans un certain ordre, un ensemble particulier d'éléments du lexique" ; c) [the word] dictionnaire [typically occurs in expressions such as] Dictionnaire médical (domain of experience), [dictionnaire] étymologique (domain of language) – there are also ternary ones : dictionnaire -> [exemplified in] Dictionnaire bilingue -> [which means] "[dictionnaire] qui donne les équivalents des mots et expressions d'une langue dans une autre langue". (The implicit copulas and other terms are here made explicit and enclosed in brackets.)

    Idiosyncrasy is characterisic of lexis and also of the dictionary, which in the vast majority of its realisations, is composed by (fallible) human beings. Just as the treatment of lexical units will vary enormously according to part of speech, frequency of usage, monosemy or polysemy, register, and other variables, so will dictionary-writing tend to vary according to time (beginning, middle or end of the alphabet, or of the writing of the dictionary, even day of the week) and writer (entry-writer A and entry-writer B are individual human beings and not clones or machines).

    Setting aside the question of idiosyncrasy and variability of lexical units and dictionary-writing (the latter nevertheless an important obstacle in computerizing the Trésor de la langue française – see below), the well-ordered dictionary requires three types of sophisticated competence on the part of the user : 1) linguistic competence obviously ; 2) dictionary competence, a particular type of textual competence, enabling one, for example, to find a word beginning with m by opening the dictionary more or less in the middle, to know that adj. means adjective/adjectif (and through linguistic competence to know what an adjective is), etc. ; 3) pragmatic competence to make sense of references to the outside world : Le dictionnaire de l'Académie française, "calculateurs électroniques", etc.

    The requirement of different types of user competence combined with the frequent use of ellipsis can result in cases of ambiguity which tax the analytical faculties of the dictionary reader and render powerless the analytical parser of the computer. The following examples are taken from the entry for GAGNER in Lexis (Wooldridge et al. 1992) :

    Each of the seven items contains an equation of synonymy, the equation concerning either the whole or part of the first term : the object of the verb in a and c, the adverbial qualifier in f, the verb in d and g, the whole expression in b and e. Linguistic competence is necessary to equate quelque chose and l' (a), ne... pas lourd with peu (f), the conjugated form ai gagné with the infinitives attraper, prendre and chiper (g). The dictionary user also has to deal with the variability of the synonymy delimiters and indicators (parentheses, brackets, equals sign, syn. label, upper case).

    In brief, the dictionary, in theory a systematic relational database, with ordered records and recurrent fields, may in human practice be as variable as the lexicon it sets out to describe. Successful applications of the computer to man-made dictionaries are then usually modest in their ambitions. Computer-driven dictionaries (machine dictionaries) tend to be procrustean in their treatment of language, or limit themselves to relatively simple areas of lexis such as terminology.

    2. Pre-WWW (World Wide Web)

    Modern lexicography did not wait for the invention of the computer, nor even that of the 17th-century calculating machines of Leibniz and Pascal, to apply computer methods to dictionaries. In 1539 the father of modern lexicography, Robert Estienne, King's Printer, bookseller, humanist and lexicographer, published his Dictionaire francoislatin, a "mirror-copy" of his Dictionarium latinogallicum of the previous year. Each French word and expression contained in the glosses and equivalents of the Latin-French dictionary had its own headword or sub-headword in the French-Latin ; each Latin word or expression contained in the headwords and examples of the Latin-French occurred as an equivalent to the corresponding French in the French-Latin. Example of the words aboleo and abolir :

    Moving forward four centuries and several decades, we find the first applications of the computer to lexicography in the 1960s and 1970s. In the 1960s the Centre pour un Trésor de la langue française in Nancy started keyboarding representative works of literature and technical treatises to provide source material for its print dictionary, the Dictionnaire de la langue du XIXe et du XXe siècle, commonly known as the Trésor de la langue française or TLF. In the late 1970s there appeared in England two machine-readable dictionaries, the Oxford Advanced Learners Dictionary and the Longman Dictionary of Contemporary English ; the latter used the computer not only to print off the paper dictionary, but also to help in its writing (Meijs 1991 : 143-5).

    The first early dictionary to be computerized was Jean Nicot's Thresor de la langue françoyse of 1606. The text was keyboarded in Nancy and Toronto between 1979 and 1984, indexed at the University of Toronto with the mainframe concordance program COGS, published in microfiche concordance form in 1985, indexed as a standalone interactive database with WordCruncher in 1988, and put on the World Wide Web in 1994 (see sections 3 and 4). It is not without interest to note that in the early 1980s funding agencies expected concordance projects to undertake lemmatization of text forms. An argument had to be made to demonstrate the absurdity of attempting to lemmatize an already partly lemmatized text : dictionary headwords are lemmas. The Nicot project initially had the ambition of labelling information fields (Wooldridge 1982), until it quickly became obvious that such fields, though present and analysable by the human brain, are impossible to delimit systematically in a complex early dictionary such as Nicot's Thresor, where position, typography and abbreviation are variable, and functional polyvalence is common. The challenge is not negligible in modern dictionaries, where explicit field-labelling is the norm. Other early dictionaries have since been digitally retroconverted, notably Samuel Johnson's Dictionary of the English Language, published on CD-ROM in 1996.

    The 1980s saw the emergence of large-scale computer-assisted lexicographical enterprises. The COBUILD (Collins and Birmingham University International Language Database) Project started in 1980, with the intention of creating a corpus of contemporary English for the writing of an entirely new dictionary and grammar. The young discipline of corpus linguistics and the COBUILD Project fed off each other in this innovative lexicographical environment. (Sinclair 1987, Renouf 1994). The New Oxford English Dictionary Project was formed to produce the second edition of the OED with the aid of computer technology. The project was international in scope : it was conceived and directed in England, the role of the computer was defined and implemented in Canada, the text was keyboarded in the United States of America. The second edition appeared in print in 1989 and on CD-ROM in 1992.

    Where early electronic tagging of dictionaries was restricted to typographical codes for printing the finished product, it soon became necessary to add information tags so that not only could the text be correctly displayed on screen or paper, but it could also be searched and referenced by fields. The dictionary was just one of the types of text whose structure was analysed by the Text Encoding Initiative (TEI) (Ide et al. 1992).

    The last decade of the twentieth century witnessed the proliferation of electronic dictionaries distributed on CD-ROM. For example, the 1993 edition of the Random House Unabridged Dictionary came in both print and on CD-ROM, the two being sold together for the price of one. As one might expect of a freebie the functionality of the Randon House CD-ROM is rudimentary. On the other hand, the CD-ROM version of the Petit Robert, published in 1996, offers many advantages over the print edition : besides basic look-up of words and their entries, the user can search for anagrams (the search term dome produces dome and mode), homophones (saint produces sain, saint, sein, seing), etymologies by language (families : African, Amerindian, Arabic, etc., or specific idioms : Bantu, Hottentot, Somali, etc.), quotations by author, work or character, plus full-text searches in either the complete dictionary text (full entries), or the particular fields of examples of usage or synonyms and antonyms.

    It is often the case that a number of entries are made more complete through the access to the full text granted by an electronic version of a dictionary. For example, to take the case of the Petit Robert, sabotage, applied to work, organizations or machinery in the entry for the word, is used in an important – and common – figurative sense in a quotation concerning speaker : "Sabotage de la prononciation de notre belle langue par les speakers de la radio". Dictionaries tend to be more conservative in their treatment of a word in its own entry than elsewhere.

    A brief mention should be made of computer-assisted lexicographical tools for the everyday user, the main ones being word-processor spell-checkers and thesauruses.

    A good broad account of the period 1960-early 1990s, that of computer-assisted and computer-driven lexicography resulting in print and CD-ROM dictionaries, is given by Meijs (1991) ; an in-depth one by Knowles (1990).

    3. Lexicography in the WWW era

    Like many other human practices, lexicography – and particularly lexicography – has been transformed by the World Wide Web. The Web works by virtue of words ; to quote from the title of a well-known book by James Murray's granddaughter (Murray 1977), the Web, like the dictionary, is a "web of words". One reads the words in a book, one looks up headwords in a dictionary, one surfs the Web by keywords. The millions of documents published on the Web constitute, through the structuring of search engine keywords, a vast dictionary, an encyclopedic dictionary of concepts and words. Conventional dictionaries, whether paper or electronic, pale by comparison, though many of the latter are caught in the online Web of words.

    An early demonstration of the Web as super- or meta-dictionary can be found in Wooldridge et al. 1999. A Web search of the Canadian French word enfirouaper (search term : enfirou*) collected occurrences of the verb and derivatives both used and commented on ; the documents were of all types : political and personal, newspaper report and manifesto, poetry and prose, dialogue and dictionary. The occurrences of the word in use showed that dictionary and glossary treatment is narrow and out-of-date (cf. sabotage above). Applying the principles of corpus creation and analysis learned from the COBUILD Project, the WebCorp Project at the University of Liverpool uses standard Web search engines such as Google and AltaVista to collect results from the Web and format them in easily analysable KWIC concordances (Kehoe & Renouf 2002). For example, expressions such as one Ave short of a rosary, two leeks short of a harvest supper or two sheets short of a bog roll, encountered in the novels of Reginald Hill, are individual realizations of the commonly used pattern "one/two/three/a/an/several X short of a Y" (X being constituent parts of the whole Y), which can be expressed in Google or AltaVista by variants of the search term "one * short of a". Since WebCorp is, at least at the time of writing, freely available on the Web, corpus linguistics has become a lexicographical tool for the general public.

    Meta-sites are a good source of information about online dictionaries. For French, two good ones are Robert Peckham's Leximagne - l'Empereur des pages dico and Carole Netter's ClicNet : Dictionnaires. The latter gives links for the following categories (I translate) : Multilingual dictionaries ; French-language dictionaries and encyclopedias ; Grammar, morphologiy orthography and linguistics ; Historical dictionaries ; Dictionaries of Architecture, Visual arts, Slang, Law, Economics and Finance, Gastronomy and Dietetics, History, Humour, Games, Multicultural dictionaires, Literature, Media, Music, Nature and Environment, Science, Political science, Services, Social sciences and Humanities, Sports, Techniques, Tourism, Various vocabularies ; Internet glossaries ; Discussion lists ; Lexical columns ; Other servers.

    Most online dictionaries are fairly modest in scope and are published as straight text, just like a print dictionary. A few however can be queried interactively as relational databases, and may offer other features. It is interesting then to compare those of the two main online dictionaries of English and French, the OED and the TLF.

    –– In the late 1990s a first online version of the second edition of the OED became available through the OED Project at the University of Waterloo ; a more generally available edition, the OED Online, was launched on the main OED Web site in 2000. Both versions are accessible by subscription only and allow the following types of search : "lookup" (as in the print version), "entire entry" = full-text search, "etymology", "label" (= field label). Electronic network technology has also made a significant contribution to the OED's reading program : in place of the parcels of paper slips from around the globe delivered by the Post Office to the Oxford Scriptorium of James Murray's day, readers can now send in words, references and details via the Web. The OED Web site gives a detailed history of the dictionary, thus adding to a scholarly value rare on online dictionary sites.

    –– The complete version of the TLFI (Trésor de la langue française informatisé), published on the Web in 2002, is both free and, somewhat ambitiously, lets the user limit queries to one or several of 29 fields, including "entrée" (the whole entry), "exemple" (with sub-categories of various types of example), "auteur d'exemple" (examples by author), "date d'exemple" (by date), "code grammatical", "définition", "domaine technique", "synonyme/antonyme". The 16-volume print TLF suffered from a high degree of writing variation (cf. section 1), making field tagging an extremely difficult task and forcing the INaLF team to adopt in part a probabilistic approach in creating the electronic version (Henry 1996).

    A characteristic of the Web is the hyperlink, which facilitates, among other things, the association between text and footnote (intratextual link), that between the bibliographical reference and the library (intertextual link), or that between a word A encountered within the dictionary entry for word B and the entry for word A (e.g. "anaptyxis : epenthesis of a vowel" -> epenthesis). The Dictionnaire universel francophone en ligne (DUF), an important free language resource for speakers and learners of all varieties of French and the online equivalent of the one-volume general language print dictionary to be found in most homes, has hyperlinks for every word contained within its entries allowing the user to refer to the entry of any given word with a single click (e.g. "sabotage n. m. 1. TECH Action de saboter (un pieu, une traverse, etc.)" -> nom, masculin, technique, technologie, technologique, action, de, saboter, un, pieu, traverse, et caetera).

    Apart from dictionaries of general contemporary language, there are a large number of marked or specialized ones. In the field of early dictionaries, there are several 16th- to early 20th-century ones freely accessible in database form in the section Dictionnaires d'autrefois on the site of the ARTFL (American and French Research on the Treasury of the French Language) Project at the University of Chicago : Estienne, Nicot, Bayle, Académie française (also on an INaLF server in Nancy). Interactive databases of several of these and others are on a server of the University of Toronto. Terminology, once the reserve of paying specialists, is now freely available on the Web. For example, a Glossaire typographique et linguistique or a Terminology of Pediatric Mastocytosis ; a pediatric mastocytosis term such as anaphylaxis occurs on tens of thousands of Web pages (69,700 hits with Google on 28 Sept. 2002).

    Web lexicography offers a number of tools, a significant one being automatic translation (e.g. Babelfish) intended to translate the gist of a Web document into a language understood by the user. A good account of the merits and drawbacks of automatic translation on the Web is given by Austermühl (2001).

    Along with the professional team dictionaries, such as the OED, the TLF or the DUF, and the specialized lexicons accessible on the Web, there are also to be found dictionaries and glossaries compiled by amateurs and individuals. If one wants, for example, to explore the Dublin slang encountered in Roddy Doyle's Barrytown Trilogy, the most ready source of online information is the O'Byrne Files.

    A final word should be reserved for recreational lexicography. The word games of the parlour, radio, television, books and the press proliferate on the Web. The OED site proposes "Word of the Day" ; COBUILD has "Idiom of the Day", "The Definitions Game", and "Cobuild Competition". Many sites offer "Hangman" or "Le Jeu du pendu". There are various types of "Crossword" or "Mots croisés", "Anagrams" and "Anagrammes". Online "Scrabble" has interactive play sites and tool box (dictionary) sites.

    4. A case study in technological change

    This last section takes a single dictionary-computerization project and looks at the various technological stages it has gone through over the years. The project in question is that concerning Jean Nicot's Thresor de la langue françoyse.

    a) Mecanography. When the present writer set out to analyse Nicot's Thresor in Besançon, the technology of the time used to manipulate textual data involved a BULL mechanographical computer using IBM cards and only capable of handling small, simple corpora such as the plays of Corneille or the poems of Baudelaire. The idea, which occurred in the 1960s, of putting the Thresor into digital, full-text searchable form had to wait for technological advances.

    b) Keyboarding, tape-perforation, and storing on magnetic tape. In 1979, the manual capture of half of the Thresor was begun at the Institut national de la langue française in Nancy, followed in 1980 by the commencement of capture of the other half at the University of Toronto. In Nancy keyboarded data was captured onto paper tape and then transferred to magnetic tape ; Toronto entry was sent directly from a keyboard via a telephone modem to an IBM mainframe computer. Data sent from Nancy to Toronto by mail on magnetic tape was made compatible with the Toronto input through a variety of routines written in various languages including Wylbur.

    c) Concordancing on microfiches. In 1984 the complete unified text was run through the COGS mainframe concordance program written at the University of Toronto. Practically the entire resources of the central computing service of the University of Toronto were reserved for one night to index and concord the approximately 900,000 words of the text of the Thresor. Some of the concordance output was done with Spitbol routines. The thirty magnetic tapes were output commercially to microfiches.

    d) WordCruncher on a standalone. In 1988 the data was transferred via modem and five-and-a-quarter inch floppy diskettes from the mainframe to a midi-computer and thence to an IBM AT personal computer with a 20Mb hard disk. This time it only took the resources of one small machine to index the full text of the Thresor and create an interactive concordance database.

    e) The World Wide Web. The Thresor was first put online in 1994 as an interactive database at the ARTFL Project of the University of Chicago, the ASCII data files being converted to run under the program Philologic. In 2000 ARTFL's Dictionnaires d'autrefois were installed on a server at the INaLF in Nancy using the Stella interactive database program. In the same year the Thresor was put up as an interactive database on a Windows server at the University of Toronto running under TACTweb, the data files first of all being indexed by TACT on an IBM-compatible. The reference fields – entry headword, page and typeface – derive from the tags entered manually during the first stage of data capture in Nancy and Toronto.


    The most radical effect that the computer has had on lexicography – from dictionaries on hard disk or CD-ROM, through to dictionaries on the World Wide Web and to the Web-as-mega-dictionary – has been to supplement the limited number of paths for information retrieval determined in advance by author and publisher with the infinite number of paths chosen by the dictionary-user. It is now normal for the user to feel in charge of information retrieval, whether it be through access to the full text of a dictionary or the entire reachable resources of the Web. Headwords have been supplanted by keywords.



