The World Wide Web as a linguistic corpus

Russon Wooldridge

University of Toronto

October 2003
(Text prepared for a poster presentation at the CaSTA Symposium, U. of Victoria, November 2003.)
Also published in CHWP, May 2005.

Version française



The largest electronic text database in existence, the World Wide Web, makes possible, for the first time and without any outlay of costly set-up or the expensive development of project-specific tools, the observation of current usage in a number of different languages. The Web as database has also the enormous advantage over other language corpora of behaving like natural language: it is dynamic, unceasingly renewing itself and thus offering snapshots of the present state of the language, with its proportion of new, established and aging usage.

The paper discusses the observation, since 1998, of French, including derivation, polysemy, homophony, syntagmatic variation and the genesis of new concepts and their linguistic naming. The language corpora observed are those of micro-systems – particular words, word families and lexical constructions – and not the macro-system of the language as a whole. The tools used in the observation are keywords and search engines, in particular Google; the study takes advantage of the search engine's highlighted display of keywords in result summaries, in cache copies of online HTML documents and in HTML copies of online documents in PDF or DOC format.

1. The Norman word douet

On the verges of the winding lanes in the Pays d'Auge in Normandy can be seen signs announcing that one is following the "Route des Douets". Before going to the library to consult linguistic atlases or regional glossaries one can search for the word douet in the French-language pages of the Web : Thus a few seconds of querying yield both linguistic and pragmatic information, the latter more difficult to find in atlases or glossaries.

2. Word Families and derivation

2.1. Enfirouaper and its derivatives

The Canadian verb enfirouaper receives the following treatment in the Dictionnaire québécois d'aujourd'hui (DicoRobert, 1993) : This is a semantic and onomasiological portrait with just one example of usage. What however is the real usage of the word ? In 1998 and 1999 the offerings of the Web included the following : The word is onomatopeic, used in verse and prose, uttered by human and animal alike, political, a fighting word, found in a variety of language registers.

The search engine AltaVista, unlike Google, allows one to discover lexical forms through the use of an end-of-word wild card (asterisk). The query enfirou* found, among other forms, the derivatives enfirouapage and enfirouapeux :

One notes the high degree of semantic (shades of meaning, registers) and morphological (derivatives) productivity of the verb enfirouaper.

2.2. The family of the verb scribouiller

We shall take as our starting point a well-known sentence uttered by the Général de Gaulle : The Trésor de la langue française (the word is absent from the majority of less extensive dictionaries) has this to say : Of interest is the second meaning of scribouiller, "écrire sans soin ou sans talent". To this can be added the definitions given by the TLF for scribouilleur "Celui qui scribouille; auteur, écrivain sans talent" and scribouillage "Action de scribouiller; résultat de cette action".

The Web contains de Gaulle's sentence, of course, but especially (hundreds of occurrences) the verb scribouiller used in a colloquial, attenuative register in online chats. For example :

We shall look now at the noun scribouilleur, or rather scribouilleur, scribouilleuse, and scribouilleux. In June 2003, the Web contained – i.e. Google found – 264 occurrences of scribouilleur(s), 987 of scribouilleuse(s) and 7 of scribouilleux.

A scribouilleur is, as the TLF states, "celui qui scribouille; auteur, écrivain sans talent" :

On the other hand, a scribouilleux is rather someone who scribouille according to the norms of online chat : As for la scribouilleuse (the feminine is over three times more frequent than the two forms of the masculine combined), she appears to owe her success to the name of an online journal : The home page of the Journal de la Scribouilleuse ("Le journal de bord de la Scribouilleuse") contains, among other items, the following "dishes" (the various sections are presented in the form of a menu) : Other words to be added to the list of members of the lexical family of scribouiller : a) scrib, n. f. Abbreviation of scribouilleuse ; b) scribouille, n. f. Nickname used in interactive narratives : As for scribouillage (67 occurrences found in June 2003), the majority of occurrences of the singular have the pejorative connotation of the dictionary : including an echo from the past : The plural indicates clearly that it is a question of the result of the action of scribouiller rather than the action itself : The word family of scribouiller shows then two faces on the Web : on the one hand, in the background, the reflection of the dictionary scribouiller, and on the other, in the foreground, a form which, by its onomatopeic expressivity portraying a contemporary content, suits admirably well the "underground" world – non-academic and especially non-"literary" – of online interactive writing : the world of scribouilleuses, scribouillards, scribouilleux, scribouilles and scribs who scribouillent their scribouillages in scribouillons, scribouillards or scribouill'arts.

3. Syntagmas and paradigms

3.1. Esprit de corps, esprit d'équipe

These expressions are semantically closely related (parasynonyms). The Petit Robert privileges the former, implicitly suggesting that it is more lexicalised, or more frequent, than the latter : The Dictionnaire des expressions et locutions figurées (Robert, 1979) gives more details : Esprit d'équipe is not dated, but the dictionary suggests implicitly that it is more recent than esprit de corps. The Web shows clearly the predominance of esprit d'équipe in a present-day world demanding more competitiveness (esprit d'équipe imposed from above) than solidarity (esprit de corps between equals) : One need look no further that the first results offered by Google to see that esprit d'équipe is a key term/concept :

3.2. Madame la ministre / madame le ministre

A few years ago it was possible to note from a media corpus that contemporary French accepts more and more titles such as Madame la ministre. Without the need to undertake the preparatory work of collecting a corpus of newspaper articles, one can observe at a glance the following distribution in French-language documents on the Web (end of September 2003) : These raw figures indicate : on the one hand, that the feminine (at least in the context Mme/Madame la ministre) is the more frequent in French-speaking countries in general and in four francophone countries in particular ; and on the other hand, that it has an above-average frequency in Canada, Belgium and Switzerland (almost exclusive in Canada), but below average in France.

3.3. Se moquer de qch/qn comme de sa première chemise

This sort of expression allows variable formulation (Se moquer de X comme de sa/son première/premier Y) : a verb belonging to the semantic field of se moquer de, a noun belonging to that of the first object possessed by the subject of the verb. The queries "comme de * premier" and "comme de * première" produce, among others and with little noise (Google shows them in its display of results – September and December 2002) : The results obtained on the Web show the high degree of productivity of the expression.

4. Homophonic, paronymic, morphological confusions

This section is not without interest to the French teacher trying to understand learners' mistakes.

4.1. Sous les meilleurs auspices / sous les meilleurs hospices

The confusion is so common that it is not surprising the Web site of the television station France 2 put up the second version (until a reader pointed out the mistake!). Homophonic confusion is not limited to the last word of the expression : Not to mention hauspices or ospices.

4.2. Conjecture / conjoncture

The paronymic mistake dans la conjecture actuelle, for dans la conjoncture actuelle, appears to be far less frequent than the homophonic mistake just mentioned : On the other hand the confusion se perdre en conjectures / conjonctures is less uncommon : The occurrences of the incorrect form correspond almost exclusively to a real confusion ; there is only one exception in Google :

4.3. Plurals in -x

This is a complex zone, often grey, of French grammar. There are first of all accepted variants : Google's French-language pages (April 2003) show the following figures : Then there are the well-known -oux : These figures, like all raw figures, need to be refined, of course. For example, the surprising figure of 16,400 occurrences of pous includes a certain number of proper nouns and typos (pous instead of pour). We shall add the occurrences of the plural of the adjective chou : On the Web (Google, French-language pages), we found in April 2003 : To finish with, we shall look at what would be considered mistakes by normative grammar :

5. Canadian and French usage

5.1. Courriel vs. email

The sémantic field of courriel / email is fairly difficult to cover statistically. We shall approach it from two Web sites, one Canadian, the other French : There is one distinction and several confusions. The Office makes a distinction between the virtual system – for which it recommends courrier électronique and gives two synonyms, including courriel – and its punctual realization – for which it gives, without making a recommendation, courriel, the synonyms courrier électronique and message électronique and the proscription of e-mail. Polysemy then of courriel and courrier électronique. In real usage, it is the meanings of "message électronique" and "adresse électronique" that predominate and invoke the simple forms, as can be seen on an address page of l'Ecole Doctorale Chimie et Sciences du Vivant de l'Université Joseph Fourier de Grenoble which contains six distinct forms : These two sites – the prescriptive Office (langue) and practical Grenoble (discours) – would seem to suggest that the simple form courriel is Canadian, whereas the variants e-mail, email, mail, mel, mél and mèl are French. How can one confirm this impression on the Web ? We shall put aside the form mel, too often a proper noun (Mel Gibson, Mel'cuk, etc.), and take into account the fact that for Google the query e-mail corresponds indifferently to e-mail and email, that mél corresponds indifferently to mél, mel or mèl. Our observation is limited to explicitely Canadian (.ca) addresses on the one hand, and French (.fr) on the other, in the French-language pages of Google (April 2003). In short, courriel is much more frequent on Canadian sites, e-mail, email, mél and mèl on French ones. The forms e-mail and email should doubtless be looked at in more detail since they are often encountered in English sequences contained within what Google considers French-language documents.

5.2. Pourriel vs. spam

After courriel, pourriel. The latter word, which l'Office québécois de la langue française insists on distinguishing (pointlessly, in our opinion) from polluriel (pourriel = Eng. junk e-mail ; polluriel = Eng. spam), is, like courriel, of Canadian origin. Among the many references in online Quebec media, we shall quote André Forgues : Forgues had talked about pourriel in 1999 in Cyberpresse ; l'Office québécois de la langue française had already made proposals for the naming of the phenomenon in 1997. The Jargon français notes the origin of the term in 1999 : What does the Web say in April 2003 ? In short : a) the plural pourriels is more frequent than the singular pourriel in the global French-language Web – it concerns concrete objects –, whereas on national servers the singular, the general phenomenon, predominates ; b) pourriel(s) is more frequent on the Canadian Web than on the French Web ; c) the anglicism spam is by far the most frequent word in all cases.

6. Genesis of the concept "une autre mondialisation" and its lexical denomination by altermondialisation

With the Larzac gathering preceding the WTO conference in Cancun, the French press of August 2003 made a higher than normal use of the terms altermondialisation, altermondialiste(s) and altermondialisme. For example : Altermondialisation is the phenomenon ; altermondialisme is the ideology ; the altermondialistes are the followers of the altermondialiste movement. In September 2003, Google's French-language pages gave the following statistics : To what extent does the Web allow for the dating of the concept and the lexical term ? We shall first of all quote from an article of the Wikipédia : The idea of a different world, of a different form of globalization would seem to date back to 1999. For the years 1999 and 2000, several attestations can be found, including : In September 2003, the following frequencies could be observed for certain key expressions (French-language pages, Google) : We shall now look at online attestations of altermondialisation, altermondialiste and altermondialisme. These results, obtained in a few minutes of querying, show that the first attestations of this word family can easily be dated back to the beginning of the year 2002, or even earlier. According to these first data, the word altermondialisme seems to have appeared after the other two, which is logical : altermondialisation, having replaced antimondialisation in order to present a more positive opposition to mondialisation, later became an ideology, l'altermondialisme.

7. Dictionary selectivity and Web extensivity : doudou

To illustrate this point (we have already done so in passing in several of the preceding sections), we shall take as our starting point Richard Desjardins' song Caroline : What then is a doudou ? We first of all look in a few dictionaries : If there is a certain amount of confusion, we can say that globally doudou, a feminine noun, is a colloquial, affectionate Caribbean term designating a woman. Is Desjardins, a Quebecer, then telling the young Caroline to leave her dear (Caribbean) woman ? It hardly seems likely.

Let us then look elsewhere, that is to say on the Web. The Swiss singer Henri Dès says on the subject of his song Mon doudou :

We can now propose the hypothesis that one says une doudou in Canada, un doudou in Europe.

To turn to raw statistics (Google, French-language pages, May 2003) :

For the feminine, there are occurrences of both meanings, that of the dictionaries and that of Desjardins and Dès. The second is the one which interests us at present. We shall quote two pages : Un or une doudou is then a stuffed animal or a piece of cloth chosen by the infant as the inseparable friend who brings comfort. Children's language – a large part of which disappears when the infant reaches school age, to return with parenthood – is far more present on the Web than in dictionaries.


Macro-lexicography is today based on the analysis of costly and essentially static large corpora, thus obsolescent in respect of actual usage, that of the moment of consultation of the dictionary. The observation of the Web as a corpus of linguistic usages, situated as it is at the level of micro-lexicography (traditionally the domain of lexicology or word studies), has the advantage of being able to renew itself constantly by being based on dynamic corpora.

More detailed analyses of the phenomena presented in this paper can be found in the Net des Études françaises at <> or <> (mirror site).

References (dictionaries only)

Dictionnaire des expressions et locutions figurées, Paris : Robert, 1979.

Dictionnaire québécois d'aujourd'hui, Saint-Laurent : DicoRobert, 1993.

Petit Robert, Paris : Dictionnaires Le Robert, 1993.

Trésor de la langue française informatisé, <>.