Metalinguistic Keywords as a Structural Retrieval Tool for Early Dictionaries

T. Russon Wooldridge & Isabelle Leroy-Turcan

University of Toronto & Université de Lyon III

© 1996 R. Wooldridge & I. Leroy-Turcan

Résumé en français

Les dictionnaires anciens n'ont pas une structure suffisamment claire et récurrente pour permettre un balisage systématique des champs informationnels. Les deux concepts de "mots clés métalinguistiques" et de "recherche floue" ont l'avantage de faciliter l'interrogation des champs informationnels sans déformer le texte.

Metalinguistic Keywords, Dictionary Information Fields, Fuzzy Structures, Lemmatization

    1. Headwords and sub-headwords
    2. Typeface and information fields
    3. Metalinguistic terms as keywords


Early dictionaries use a diversity of textual structuring systems, both in the macrostructure and the microstructure. In the field of French general dictionaries the most striking case in this regard is probably Jean Nicot's Thresor de la langue françoyse (1606), a dictionary combining monolingual, bilingual and multilingual descriptions of functional linguistic, etymological and encyclopedic information. In the more specialized field of French etymological dictionaries, the first major work, Gilles Ménage's Dictionnaire étymologique, ou Origines de la langue françoise (1694), also combines the typologies of etymological, general language and encyclopedic dictionary. The individual entries of these two lexicons use different content and formal models according to the particular descriptive or analytical objective. Since these models are often only partially realized, there result a certain amount of structural fuzziness and, from the user's perspective, unpredictability (Wooldridge 1977; Leroy-Turcan 1994).

Thus the database of Nicot's TLF -- and that of Ménage's DEOLF in progress -- are only tagged for page-column, headword, typography and language. In order to give access to information fields -- part of speech, definition, field labels, example, quotation, source, etymology, etc. -- without distorting the text, indexed lists of lemmatized metalinguistic keywords are provided, allowing the retrieval of occurrences of information field markers. For example, the lemma MASCULIN is linked to the contexts in which the lexicographer has indicated -- by «masculin», «m.», «mas.», «masc.» or «mascul.» -- the gender of a masculine noun, adjective or participle.

The international project -- announced at the Institut de France in November 1994 (Wooldridge 1994; cf. Leroy-Turcan & Wooldridge 1995) -- to computerize the eight complete editions of the Dictionnaire de l'Académie française (1694-1935) also has to deal, though to a lesser degree, with structural approximation and variation. Again, imposing on the text a highly-structured information field metatext would result in misrepresenting it.

The aim of our paper is to give a measure, based on extensive computerized samples, of the efficiency of metalinguistic keyword indices, allied to microstructural position and typographical markers such as roman/italic/bold, upper/lower case and indenting, as a retrieval tool for the Dictionnaire de l'Académie ("Acad") and, by implication, early dictionaries in general. The sample database comprises, for each edition, the entries ÂME, DOUAIRE to DOUZIL, GAGNER, GRAS, GROS, LOIN to LOISIR, QUE, QUEUE, TIGE to TINTOUIN, VOLER. Reference may also be made to other entries to illustrate particular problems. Explicit tagging is done for edition, headword, paragraph, typeface and column-page. Of these, all are based on systematic objective formal criteria except headwords. The paper sets out to show that, while tagging as a means of retrieving headwords is preferred to the keyword approach of typographical marking (in this case capitals and position), both typeface and metalinguistic keywords can be used for the location of information fields.

An important concept in this approach to data retrieval is that of "fuzzy searching". Simply put, it means that rather than expending an enormous effort on retrieving 100% of what is sought and nothing more, one can obtain practically the same results with considerably less effort by contenting oneself with a range of 95% to 105% of the theoretical total, the small number of irrelevant occurrences being easy to discard. Fuzzy searching is a particularly appropriate tool for structural approximation (see above).

1. Headwords and sub-headwords

The ordering of lexical items in the first edition (1694) is different from that used in the others (1718-1935). In the first edition words are grouped by etymological families; the base words of the various families are given in alphabetical order in the primary macrostructure, whereas the other members of each family are arranged in some sort of logical derivational order. Thus for words in TIM- the primary macrostructure comprises TIMBALE, TIMBRE, TIMIDE, TIMON and TIMPAN, while under TIMIDE are given TIMIDE, TIMIDITÉ, INTIMIDER, TIMIDEMENT and TIMORÉ. (Omitted from consideration in this short paper is the minor question of participial forms, e.g. INTIMIDÉ.) From the second edition on, a strict one-level alphabetical ordering is used, thereby separating INTIMIDER, TIMIDE-TIMIDEMENT-TIMIDITÉ and TIMORÉ.

The formal properties of the items of the alphabetical macrostructure are: large capitals and initial position in a paragraph. To these are added, in some editions, bolding (Acad6-8) and increased preceding line spacing (Acad8). Although large capitals are also used for cross-references in the first edition («TIMPAN. Voy TYMPAN.») and the initial position of a paragraph may be occupied by a variety of objects, the two together are in general sufficient to determine all and only alphabetical headwords. The very occasional exceptions to this rule can be considered as accidents (e.g., QUI VIVE, in the paragraph «QUI VIVE. Voy VIVRE.» placed within the article QUI, functions, like QUICONQUE or QUIDAM, as a sub-headword of QUI).

The secondary macrostructural level, that of sub-headwords, is more problematic. In order to be able to compare the contents of the macrostructure of the first edition with those of the others, it is necessary to tag as headwords in Acad1 those types of subsidiary items that appear in the primary macrostructure in Acad2-8 (e.g. TIMIDITÉ, INTIMIDER, TIMIDEMENT and TIMORÉ in the example given above). These sub-headwords have two formal properties: small capitals and initial position in a paragraph. However, these two properties are frequently shared by what in the dictionary as a whole (Acad1-8), as in the general lexicographical tradition, have to be considered as sub-addresses functioning within the microstructure. The various difficulties involved in distinguishing, in Acad1, sub-headwords from sub-addresses are many; we shall restrict ourselves here to an indication of the problem through a consideration of the first few capitalized paragraph beginnings in the article LONG in Acad1: «LONG, LONGUE. [...] SE FORLONGER. [...] LOIN. [...] AU LOIN. [...] LOIN A LOIN, DE LOIN A LOIN. [...] LOINTAIN, AINE. [...] ELOIGNER. [...]». Applying the rule that the end of the first unit of a capitalized paragraph beginning is marked by the first comma unless it be preceded by a period, we are left with LONG, SE FORLONGER, LOIN, AU LOIN, LOIN A LOIN, LOINTAIN and ELOIGNER. A comparison with Acad2-8 and with general dictionary tradition tells us that LONG, FORLONGER (SE), LOIN, LOINTAIN and ELOIGNER are (sub-)headwords, and that AU LOIN and LOIN A LOIN are sub-addresses of LOIN.

To sum up, formal, objective criteria are insufficient for automatic headword and sub-headword tagging of Acad1. A reasonable way to proceed is to automatically tag capitalized paragraph beginnings and then in a manual, interpretative post-edition to eliminate the headword tags of sub-addresses.

2. Typeface and information fields

The two main typefaces used in the Dictionnaire de l'Académie are roman and italic. These have different semiotic functions, as also do capitals and lower case. The bolding added to headwords in Acad6-8 increases the consultability of the text but is semiotically redundant. The semiotic system is essentially lower-case roman (unmarked typeface) for the basic textual level of the lexicographer's metalinguistic discourse containing part of speech, usage mark, semantic filiation, definition and the copulas articulating the various linguistic and metalinguistic units within the microstructure; and roman capitals, italics and bold (marked typefaces) for autonyms -- i.e. units of the described object, the language: words, idioms, collocates, examples, synonyms, etc. In this section we shall examine the degree to which typeface, allied to position (the absolute or relative position of an item in the microstructure), can be used to retrieve information fields.

As mentioned above, lower-case roman is used for several information fields: part of speech is usually given immediately after the headword («DOUBLE. adj. de tout genre.»), occasionally elsewhere («Il est aussi subst.»); usage and semantic marks tend to be non-initial in the paragraphs of the discursive texts of the earlier editions («On dit fig. et fam. [...] une cervelle, une teste bien timbrée, mal timbrée» Acad2-5 s.v. TIMBRER), initial in the later ones («Fig. et fam., Une cervelle, une tête timbrée» Acad6-8). Italics are used systematically for examples, irregularly for collocates and synonyms: «GAGNER, se joint quelquefois avec la préposition Sur» (Acad2; Acad1 «sur») vs. «SANS DOUTE, [...] se joint quelquefois avec Que» (id.; Acad1 «que»); «DOUBLON. [...] On dit aussi, Pistole» (Acad6; Acad5 «[...] que nous appelons Pistole») vs. «Ne... que peut, dans certains cas, être considéré comme entièrement synonyme de l'adverbe Seulement» (id. s.v. QUE = Acad7-8).

Bold typeface can be used on its own to retrieve headwords and co-headwords in Acad6-8 (325 sequences in the sample database = 100% of the (co-)headwords). Large roman capitals (387) are used for headwords (Acad1-5) and co-headwords (Acad2-5) in 374 cases (96.64%) and for cross-references in 13 (Acad1, 3.36%). Small roman capitals (790) are highly polysemous: they are used regularly for co-headwords in Acad1 (5 occurrences = 0.63%), sub-headwords and sub-addresses in Acad1-8 (726 = 91.90%), and cross-references in A4-8 (55 = 6.96%); their status of marked typeface explains four idiosyncratic, or irregular, occurrences: a synonym («On dit aussi DUPLICATA» in Acad8 s.v. DOUBLE), a collocate («Il se joint quelquefois avec la préposition SUR» in Acad8 s.v. GAGNER) and an example element («âme rachetée par le sang de JÉSUS-CHRIST» in Acad6-7 s.v. ÂME; cf. Acad5 «[...] JÉSUS-CHRIST», Acad2-4 «[...] Jésus-Christ»).

According to the general logic of the dictionary, definitions (metalanguage) are printed in romans, collocates, synonyms and antonyms (language) in italics. When, as is often the case for adjectives and adverbs, the definition is a single word rather than a periphrasis, the distinction between definition and synonym is unclear, with the result that the use of different typecases can become muddled: «Il signifie aussi, Espais, et est opposé à delié, delicat» (Acad1-7 s.v. GROS), instead of «Il signifie aussi, Espais, et est opposé à delié, delicat» (cf. «DOUBLE [...] Il est opposé à simple» Acad1).

It becomes necessary then to turn to metalinguistic keywords, such as SIGNIFIE, SE JOINT AVEC, ON DIT AUSSI, OPPOSÉ À, etc., for help with retrieving definitions, collocates, synonyms and antonyms.

