Early Books and The Trouble with SGML

Early Books, RET Encoding Guidelines, and the Trouble with SGML

Ian Lancashire
University of Toronto

November 11, 1995

1. Introduction

Standard Generalized Markup Language (SGML) encodes medieval and Renaissance manuscripts and printed books with difficulty. This computer language is an ISO standard, but one acknowledged more in the breach than in the observance. In this paper I argue that the humanities should follow the originators of the World Wide Web, who made HTML (Hypertext Markup Language), an encoding standard using SGML syntax but serving purposes alien to the intentions of SGML's creators. The Text Encoding Initiative (TEI) SGML document-type definition is unusable for my kind of scholarly editing. However, the TEI Guidelines is an excellent discussion of tagging, principles and practice, and its system of over 400 tags is the starting point for anyone interested in text encoding.

I attended the founding meeting of TEI at Vassar College in 1987 and served on two of its committees (Literary Studies and the Advisory Committee, co-representing the Modern Language Association). The TEI editors knew the views I express below before they issued the current version of the TEI Guidelines. My objections helped me shape the encoding guidelines of Renaissance Electronic Texts (RET). These are published on the World Wide Web at

These give practical advice and examples for applying two encoding schemes, SGML and COCOA to many textual situations. Text-Analysis Computing Tools (TACT) adopts the second scheme, COCOA, from the Oxford Concordance Program. I use SGML to encode Representative Poetry, a textbook edited by members of the Department of English at Toronto and published by the University of Toronto Press from 1912 to 1967, and two volumes of RET, the 1623 edition of the Elizabethan homilies and the 1609 quarto of Shakespeare's sonnets. I use COCOA to encode Representative Poetry and many other English literature texts forthcoming from the Modern Language Association TACT manual. These texts run from Beowulf to H. G. Wells' The Time Machine.

2. The Trouble with SGML 1: Users Looking for Tools

Many routine difficulties arise in using SGML (Standard Generalized Markup Language) today. The manuals and books that explain its syntax are generally written for technical experts. Michael Sperberg-McQueen, co-editor of the TEI Guidelines, describes SGML as "a formal computer language for representing text in electronic form, defined by International Standard ISO 8879" (Sperberg-McQueen 1995: 248). Every SGML document must be accompanied by a document-type definition (DTD), a structured data-file identifying tags and their relationships, but SGML itself has no tags. Users must invent all of them. The TEI Guidelines creates a wonderful tagset but runs to well over 1000 pages. The CD-ROM edition is essential to navigate the Guidelines. TEI also requires users to adopt the TEI2.DTD document-type definition (p. 49) or else modify it, a technically interesting task, comparable to revising someone's else's software. Although widely advertised as software- independent, every SGML document must be verified by an SGML parser. Such a program is not always easy to use. James K. Tauber's article, "Abandon all hope, ye who enter," describes his troubles with SGML and TEI along the road to a modestly successful conclusion.

3. The Trouble with SGML 2: Unusable Character Sets

SGML and TEI use entity references for characters not in ASCII. Such entities derive from official ISO character sets, but they do not include many characters found in pre-modern texts. For example, consider Figure 2, my sample edited text of the start of the late medieval interlude Lucidus and Dubius. Four of the seven special characters used on this page have no ISO entity references. Further, it is frustrating to make entity references for these characters because no SGML software will be able to display them. Thus if we use undisplayable entity references, no one will be able to read the texts. The entities interfere with word-recognition. For this reason, RET encloses non-ASCII characters within braces and special marks of abbreviation (brevigraphs) within vertical bars. RET guidelines offers a table of special characters and their codes. Neither braces nor vertical bars interferes seriously with reading the edited texts. See Figure 3 (b-c). We need specially-created software for this purpose, but it is unlikely that the ISO will create a special character set for medieval texts and thus that SGML software will develop techniques to display obsolete characters.

4. The Trouble with SGML 3: Where did the Carrier Materials Go?

I believe that SGML and TEI make anachronistic assumptions about text that fly in the face of the cumulative scholarship of the humanities over many centuries.

What are the principal purposes of SGML? First, it creates general textual markup that enables publishers to translate electronic texts quickly into products. Rather than having to maintain separate programs to convert WordPerfect, Word, and other proprietary word-processing schemes into local typesetting software and house style, a publisher who uses only SGML-encoded texts can devise one translation scheme. Second, SGML serves commercial text-retrieval software searching large databases and the World Wide Web. If everyone uses SGML (or HTML), everyone can search everyone else's texts. Standards are tools for the interchange of texts. For interchange purposes, SGML is essential.

However, the TEI Guidelines are inspired by a greater ambition than to do what HTML already does very well, that is, online typesetting, which translates rendition in a text onto the screen. Encoding, the TEI Guidelines say, is "any means of making explicit an interpretation of a text." It is "a process of making explicit what is conjectural or implicit." Consequently, the TEI tagset embodies interpretation of text on hundreds of points of logical structure. It erects "content models" on the dubious assumption "that there is a common core of textual features shared by virtually all texts and virtually all serious work on texts" (7). TEI did not heed the advice of its own Literary Studies subcommittee, chaired by Paul Fortier and with a membership of scholars including myself. In October 1990 Fortier wrote in his critique of TEI P1: My perspective is that coding (inputting or converting text) is not the same as interpreting. Descriptive coding as presented in the Guidelines is squarely in the domain of interpretation. Scholars do not want interpreted texts; they expect to that job themselves. When possible scholars hire assistants to input texts, and do not expect these assistants to do the interpretation. This whole aspect needs to be brought into conformity with scholarly practice, otherwise the TEI standards will not be respected. SGML was devised for technical writers to encode texts with their own interpretation of its parts. The author of a text interprets readily, but editors of other people's texts cannot. The interpretation of old texts, in particular, asks for a skeptical mind and a careful critical vocabulary. Thus a hidden assumption of SGML and TEI is that the encoder has the authoritative knowledge of the author. The TEI's own Literary Studies committee -- and a survey of humanities scholars then on-line used by it -- disagreed, but our objections ran counter to a principle, that tags should not describe displayable textual features.

SGML and TEI assume that all textual features of physical layout, specific typeface, material font, and script, etc., need not be encoded in themselves. Their interpretation, however, must be encoded. For example, italics should be tagged according to the purpose the encoder believes that it has. The TEI Guidelines thus do not encode the most immediate textual elements that an editor of pre-modern electronic texts faces: We repeat the advice given at the beginning of this chapter, that these recommendations are not intended to meet every transcriptional circumstance ever likely to be faced by any scholar. They are intended rather as a base to enable encoding of the most common phenomena found in the course of scholarly transcription of primary source materials. These guidelines particularly do not address the encoding of physical description of textual witnesses: the materials of the carrier, the medium of the inscribing implement, the layout of the inscription upon the material, the organisation of the carrier materials themselves (as quiring, collation, etc.), authorial instructions or scribal markup, etc. (p. 557) Some "carrier materials" were recognized, but only in ways that limit their usefulness in the study of early texts. I will give two examples.

The five basic attributes that every TEI element must have (pp. 45-47) are id (a unique identifier), n (number or label), lang (a code from ISO 639, identifying the language of the text governed by the tag), rend (the way in which the text governed by the tag is rendered or presented), and TEIform (the standard TEI name for the tag). Making font or script an attribute of every other tag -- rather than an element in itself -- causes problems for editors of early texts. Change in font or script almost always is significant in itself. It may indicate emphasis, a proper name, a quoted phrase, an authoritative text of some sort, a language, or combinations of these things. The use of spaced letters in a word (called monumental or lapidary style) also is substantive. For example, the word "GOD" in the Elizabethan homilies, which I edited as Renaissance Electronic Texts 1, always appears in lapidary form. However, determining which significance each element of rendition has asks for too much speculation.

The second example arises from the TEI "formework" (<fw>) tag, which is used to describe textual features for which the scribe and the printer are responsible. The <fw> tag marks many features of physical books (catchwords, signatures, running titles, foliation) but does so outside TEI structural tagging. <fw> tags are incidental and floating. TEI ignores bibliographical structures such as W. W. Greg, Fredson Bowers, and Thomas Tanselle have catalogued. All features of a bibliographical nature fall inside the TEI textual structure of <front>, <body> and <back> tags, although bibliographical structure subsumes all textual structures. Diplomatic, old-spelling editions of early works thus cannot be TEI-conformant documents unless they throw out basic TEI divisional tags and the TEI philosophy of text. Renaissance Electronic Texts accordingly proposes a two- fold parallel set of structures, one textual (content- oriented), the other bibliographical (oriented to carrier materials). See Figure 3(d) for the tagset.

At present, TEI treats sections of a text belonging to the printer or publisher as if they belonged to the author. The title-page contains the name and address of those responsible for making the book into formes and for putting signatures at the bottom of pages. It labels the physical book. The table of contents correlates the author's logical structure with widely separated page numbers, which are the printer's responsibility. The errata page is also bibliographical, not textual. It is a list of instructions from the printer: bibliographical conventions (page and line numbers, often) matched with changes in text. Scholarly editions recognize that even a single-manuscript or early single-edition text results from the efforts of at least two people, a scribe/printer and an author who may have worked on the text at different times. The textual structure of any book or manuscript is nested within a bibliographical one; and the latter has a pervasive impact on their former. Consequently, a tagging system must be able to distinguish the work of these different agents.

Multiple responsibility and the attendant uncertainties in ascertaining its boundaries thus complicate the content-model textual structures proposed by TEI for a DTD. What happens if the two systems, bibliographical and textual, conflict, as where two compositors set different parts of the same work and use different spelling systems? The two men who set Shake- speares Sonnets (1609), the forthcoming RET volume edited by Hardy Cook and myself, each interfered with the manuscript spelling -- conceivably Shakespeare's -- but did so differently. The printing house divided pages arbitrarily between the two compositors, with the result that changes of responsibility often occur in the middle of poems. Because poems are a textual division in the TEI model, bibliographical structure cuts across the textual structure at hard-to- establish points and affects the nature of the text.

SGML does not allow one to retrieve a given string according to its place in more than one structure. The so-called SGML "concur" function allows editors to encode two structures in a single text but not to retrieve strings under the two simultaneously. If the text itself has two structures (e.g., act-scene-line and classical scene-speech) in addition to the bibliographical structure, the editor cannot even represent the three structures in an SGML edition. The "concur" function only allows two structures at once. SGML has a built-in limitation.

5. The Trouble with TEI 1: Where did all the Scholars Go?

SGML elements and their attributes belong to a metalanguage, like the markup in any tagset. They are words about words, that is, metawords, signs pointing to signs, and thus depend on a theory of the meaning of text. The theory accepted by TEI is anachronistic when applied to early texts. For example, SGML structural tags like <front>,<body>, and <back> (Charles F. Goldfarb uses these in his definitive manual to SGML) impose a visual, anthropomorphic metaphor. It is like a person viewed from the front, from inside, and from the back. Yet does any medievalist believe texts necessarily or generally have this structure? Do texts have three parts, comparable to the printed book's title-page and preliminaries, the things pointed to by the table of contents, and the closing pages, which may contain an advertisement or a postscript?

TEI could have acted to remedy this fault by drawing on humanities scholarship. Then TEI would have recognized that the front-body-back metaphor obscures the basic structure endowed by multiple author-agents, the printer/scribe and the author. The bibliography at the end of the TEI Guidelines does not refer to any of these scholars or to any of numerous literary theorists who had spent a lifetime analyzing textual structures.

This neglect of humanities scholarship is pervasive in the Guidelines.

For example, the TEI definition of the core tagset includes basic tags for poems. It asserts that "The fundamental unit of a verse text is the verse line rather than the paragraph ..." and uses the following hierarchy to represent poetic structure: <l> (verse line), nested beneath <lg> (line group), which in turn is nested under divisional <div>, <div0>, etc. TEI cites no authorities for this model. Neither does it qualify its assertion. Some literary critics (Roche 1988: 3) say that the line is the basic unit of verse (the word "verse" under some circumstances means "line"). However, most believe that the metrical foot is the basic unit (Barnet et al 1960: 90). Metrical feet nest within the line sometimes, but arguably within the rhythmic unit as often; and a rhythmic unit may be part of a line, or may cross over lines in so-called "run-on lines." It is true that early poetry, especially, is highly formal, but the basis for its form is not the line. Many manuscripts, from Beowulf onward, do not use lineation to mark poetic form. Verse lines are run together seamlessly as if they were prose. The two TEI tags, and their rationale, do not do justice to the reality of poems. Beginning in the 19th century, when fixed metrical form itself begins to disappear as a part of the model, the TEI model totally collapses. As Bernard Dupriez says, "Each poem nowadays possesses its own structure" (1991: 346). If so, developing generalizable content models is impossible. A unique divisional structure must be assigned to many modern poems.

See Figure 2 for another illustration of the problem. What is the correct verse lineation of Lucidus and Dubius? It is not the manuscript's lineation, encoded with <bkdv3>. I have added a <plydv3> tag to order text lines differently from the manuscript/book lineation. Braces determine where this different verse order breaks from the manuscript lineation: they are encoded with <xref> and <target> tags. These braces link rhyming lines. Thus meter is the fundamental unit of verse structure, not the kind of lineation represented by the TEI <l> tag, which in fact misleads the reader in this case.

TEI also treats the relations of the core elements for verse to other tags oddly. Figure 4 lists the elements that the verse line tag <l> may include, and that may include it. Why does the <l> tag occur within cast-lists, descriptions of dramatic settings, and line-groups but not within stanzas or refrains? A TEI-er would say that a playwright might choose to put verse in a cast-list and that stanzas are types of textual divisions and so could be attributes of tags like <div1> and <lg>. Would a literary historian, however, concede that stanzas or refrains should invariably be called divisions or line-groups? Metre is a far more important defining division that makes poetic structures what they are. What about the poem with only one stanza, like a limerick? Can a poem with only one thing at the top level be said to be divided in terms of stanzas? does it make any sense to say that an entire poem constitutes a line- group or that one division comprises everything? Sir Gawain and the Green Knight has fits, the The Faerie Queene has books and cantos, but what does Beowulf have? Alan Brodeur argued it had digressions. Are they divisions?

Now consider the things that a TEI verse line tag may contain. Can it in any world of text familiar to us hold <biblFull>, which "contains a fully-structured bibliographic citation, in which all components of the TEI file description are present" (p. 874)? or <camera>, which "describes a particular camera angle or viewpoint in a screen play" (p. 882)? or <oRef>, "a reference to the orthographic form(s) of the headword" (of a dictionary)? or , which is exactly what it seems to be, "text displayed in tabular form, in rows and columns" (p. 1175)? In what textual universe does it make sense to define a verse line as potentially containing a one-row table? Now look at what verse-lines do not contain. There is no tag for metrical foot. The only encoding that makes it possible to encode meter is a general tag to be placed in the TEI header to a document. Surely it should be clear that the TEI Guidelines, developed by three organizations, two of which were about computing in the humanities, misinterprets verse structure.

6. The Trouble with TEI 2: A Forgotten Mandate

Section 1.3 of the Guidelines describes the historical background of the TEI. Nine principles were agreed on by thirty people who met to plan the TEI in November 1987, myself among them. The authors admit readily that TEI did not achieve a number of the important ends of that conference. Principle 3, in particular, was not achieved. It was that The guidelines should define a recommended syntax for the format, define a metalanguage for the description of text-encoding schemes, describe the new format and representative existing schemes both in that metalanguage and in prose. (p. 10) TEI did not do these things. "The only metalanguage used ... is that of SGML, and no formal definitions are given of other common encoding schemes" (p. 11). TEI took over what ISO had already developed for technical writing and publishing and tried to accommodate scholarly needs within its framework. TEI also ignored existing encoding schemes, although they had been developed ground-up by humanities researchers since the late 1960s (e.g., the encoding of the TLG, ARTFL, ADMYTE, COCOA, the Helsinki and London-Lund corpora, shareware like TACT, and commercial software such as Oxford Concordance Program and WordCruncher). TEI was administered by computational linguists, computer scientists, and computer professionals. Their goal appeared to shift from developing an interchange format for humanities texts, based on their features, to imposing an existing encoding format, SGML, on the field.

Principle 4 also states that "The guidelines should propose sets of encoding conventions suited for various applications" (p. 10). TEI admits it does not do so, "since consensus on suitable conventions for different applications proved elusive; this remains a goal for future work" (p. 11).

7. Where do we Go from here?

Editors should begin by resolving to re-assess the tagging of electronic texts from first principles, without necessarily being constrained by any known system, SGML, or TEI, or RET. The different tasks of tags should be affirmed. They certainly can instantiate the editorial apparatus of a new medium (and hence are interpretative), but they also can describe literal and visible features of text. The scholarly community should assert its traditional academic freedom in the tagging of texts. No one editorial standard has emerged after 3,000 years of work, and I am skeptical that one will emerge. A scholar should have the right to publish texts encoded in the way he or she believes best suits the purpose of the analysis.

Scholars must revisit the purposes of creating encoded electronic texts: are they just for text retrieval? If so, I think we are wasting much of our time, because although researchers need tools to find references, they are not themselves working on text-retrieval systems. They trying to discover new things about texts. And so the question has to be put, what new things can we do with electronic texts, and then what kind of tagging do we need to accomplish those things? Analysis, not retrieval, is the important issue.

Many at this conference have already spent years thinking about these issues. For these people, more practical issues must dominate.

First, we do not yet have an inventory of codes for early characters. Only about a third of the non-ASCII characters RET editions need so far have ISO entity names. Ideally we should have the choice of not looking at either codes or standard displays of these characters -- we might want to see images of the originals -- but we will still need some way of classifying what we see. We need to define logically the early character set, or we are truly building electronic texts on sand. We have to be able to name characters so that they can be retrieved and discussed. By developing and insisting upon our own specialized character set, preferably keyed to images of the characters we are describing, we can exert pressure on publishers to deal with the issue of text representation.

Second, we need a basic tagging grammar and tagset for representing features of what the TEI Guidelines call the carrier materials. The World Wide Web uses a DTD to handle basic on-line typesetting. We need an extension of HTML to name all the literal textual phenomena we see in early books and then to describe the relations of those parts. After all, what else do we have to build an interpretation of early texts on but an accurate representation of the features of the early manuscript and print culture? Defining even apparently simple things, such as the hierarchy of pages within sheets within gatherings, is not easy. As we know, pages from the same side of the sheet (or from the same form) alternate when the sheet is folded in text order. How does one tag the form so that the appropriate pages all nest within the proper form? Try it.

SGML is a text-translation tool. It exists to turn texts encoded by scholars for their own research purposes into an interchange format suitable for publication, whether in printed books or during on-line retrieval. This format must be able to handle all character sets and all literal, uninterpreted textual phenomena. TEI did not create such a format, but it did point us in the right direction. The success of HTML shows that SGML document-type definitions can be made that defy the original intentions of SGML itself in respect of procedural tagging. I hope that this conference will take up again the spirit of the 1987 Vassar College meeting to create a metalanguage and encoding method faithful to humanities texts so that, in due course, an interchange format that meets those scholarly objectives can be devised.