November 11, 1995
I attended the founding meeting of TEI at Vassar College in 1987 and served on two of its committees (Literary Studies and the Advisory Committee, co-representing the Modern Language Association). The TEI editors knew the views I express below before they issued the current version of the TEI Guidelines. My objections helped me shape the encoding guidelines of Renaissance Electronic Texts (RET). These are published on the World Wide Web at
What are the principal purposes of SGML? First, it creates general textual markup that enables publishers to translate electronic texts quickly into products. Rather than having to maintain separate programs to convert WordPerfect, Word, and other proprietary word-processing schemes into local typesetting software and house style, a publisher who uses only SGML-encoded texts can devise one translation scheme. Second, SGML serves commercial text-retrieval software searching large databases and the World Wide Web. If everyone uses SGML (or HTML), everyone can search everyone else's texts. Standards are tools for the interchange of texts. For interchange purposes, SGML is essential.
However, the TEI Guidelines are inspired by a greater ambition
than to do what HTML already does very well, that is, online
typesetting, which translates rendition in a text onto the
screen. Encoding, the TEI Guidelines say, is "any means of
making explicit an interpretation of a text." It is "a process
of making explicit what is conjectural or implicit."
Consequently, the TEI tagset embodies interpretation of text on
hundreds of points of logical structure. It erects "content
models" on the dubious assumption "that there is a common core
of textual features shared by virtually all texts and virtually
all serious work on texts" (7). TEI did not heed the advice of
its own Literary Studies subcommittee, chaired by Paul Fortier
and with a membership of scholars including myself. In October
1990 Fortier wrote in his critique of TEI P1:
SGML and TEI assume that all textual features of physical
layout, specific typeface, material font, and script, etc.,
need not be encoded in themselves. Their interpretation,
however, must be encoded. For example, italics should be
tagged according to the purpose the encoder believes that it
has. The TEI Guidelines thus do not encode the most immediate
textual elements that an editor of pre-modern electronic texts
faces:
The five basic attributes that every TEI element must have (pp.
45-47) are id (a unique identifier), n (number or label), lang
(a code from ISO 639, identifying the language of the text
governed by the tag), rend (the way in which the text governed
by the tag is rendered or presented), and TEIform (the standard
TEI name for the tag). Making font or script an attribute of
every other tag -- rather than an element in itself -- causes
problems for editors of early texts. Change in font or script
almost always is significant in itself. It may indicate
emphasis, a proper name, a quoted phrase, an authoritative text
of some sort, a language, or combinations of these things. The
use of spaced letters in a word (called monumental or lapidary
style) also is substantive. For example, the word "GOD" in the
Elizabethan homilies, which I edited as Renaissance Electronic
Texts 1, always appears in lapidary form. However, determining
which significance each element of rendition has asks for too
much speculation.
The second example arises from the TEI "formework" (<fw>) tag, which
is used to describe textual features for which the scribe and
the printer are responsible. The <fw> tag marks many features
of physical books (catchwords, signatures, running titles,
foliation) but does so outside TEI structural tagging. <fw>
tags are incidental and floating. TEI ignores bibliographical
structures such as W. W. Greg, Fredson Bowers, and Thomas
Tanselle have catalogued. All features of a bibliographical
nature fall inside the TEI textual structure of <front>, <body>
and <back> tags, although bibliographical structure subsumes
all textual structures. Diplomatic, old-spelling editions of
early works thus cannot be TEI-conformant documents unless they
throw out basic TEI divisional tags and the TEI philosophy of
text. Renaissance Electronic Texts accordingly proposes a two-
fold parallel set of structures, one textual (content-
oriented), the other bibliographical (oriented to carrier
materials). See Figure 3(d) for the tagset.
At present, TEI treats sections of a text belonging to the
printer or publisher as if they belonged to the author. The
title-page contains the name and address of those responsible
for making the book into formes and for putting signatures at
the bottom of pages. It labels the physical book. The table
of contents correlates the author's logical structure with
widely separated page numbers, which are the printer's
responsibility. The errata page is also bibliographical, not
textual. It is a list of instructions from the printer:
bibliographical conventions (page and line numbers, often)
matched with changes in text. Scholarly editions recognize that
even a single-manuscript or early single-edition text results
from the efforts of at least two people, a scribe/printer and
an author who may have worked on the text at different times.
The textual structure of any book or manuscript is nested
within a bibliographical one; and the latter has a pervasive
impact on their former. Consequently, a tagging system must be
able to distinguish the work of these different agents.
Multiple responsibility and the attendant uncertainties in
ascertaining its boundaries thus complicate the content-model
textual structures proposed by TEI for a DTD. What happens if
the two systems, bibliographical and textual, conflict, as
where two compositors set different parts of the same work and
use different spelling systems? The two men who set Shake-
speares Sonnets (1609), the forthcoming RET volume edited by
Hardy Cook and myself, each interfered with the manuscript
spelling -- conceivably Shakespeare's -- but did so
differently. The printing house divided pages arbitrarily
between the two compositors, with the result that changes of
responsibility often occur in the middle of poems. Because
poems are a textual division in the TEI model, bibliographical
structure cuts across the textual structure at hard-to-
establish points and affects the nature of the text.
SGML does not allow one to retrieve a given string according to
its place in more than one structure. The so-called SGML
"concur" function allows editors to encode two structures in a
single text but not to retrieve strings under the two
simultaneously. If the text itself has two structures (e.g.,
act-scene-line and classical scene-speech) in addition to the
bibliographical structure, the editor cannot even represent the
three structures in an SGML edition. The "concur" function
only allows two structures at once. SGML has a built-in
limitation.
TEI could have acted to remedy this fault by drawing on
humanities scholarship. Then TEI would have recognized that
the front-body-back metaphor obscures the basic structure
endowed by multiple author-agents, the printer/scribe and the
author. The bibliography at the end of the TEI Guidelines does
not refer to any of these scholars or to any of numerous
literary theorists who had spent a lifetime analyzing textual
structures.
This neglect of humanities scholarship is pervasive in the
Guidelines.
For example, the TEI definition of the core tagset includes
basic tags for poems. It asserts that "The fundamental unit of
a verse text is the verse line rather than the paragraph ..."
and uses the following hierarchy to represent poetic structure:
<l> (verse line), nested beneath <lg> (line group), which in
turn is nested under divisional <div>, <div0>, etc. TEI cites
no authorities for this model. Neither does it qualify its
assertion. Some literary critics (Roche 1988: 3) say that the
line is the basic unit of verse (the word "verse" under some
circumstances means "line"). However, most believe that the
metrical foot is the basic unit (Barnet et al 1960: 90).
Metrical feet nest within the line sometimes, but arguably
within the rhythmic unit as often; and a rhythmic unit may be
part of a line, or may cross over lines in so-called "run-on
lines." It is true that early poetry, especially, is highly
formal, but the basis for its form is not the line. Many
manuscripts, from Beowulf onward, do not use lineation to mark
poetic form. Verse lines are run together seamlessly as if
they were prose. The two TEI tags, and their rationale, do not
do justice to the reality of poems. Beginning in the 19th
century, when fixed metrical form itself begins to disappear as
a part of the model, the TEI model totally collapses. As
Bernard Dupriez says, "Each poem nowadays possesses its own
structure" (1991: 346). If so, developing generalizable
content models is impossible. A unique divisional structure
must be assigned to many modern poems.
See Figure 2 for another illustration of the problem. What is
the correct verse lineation of Lucidus and Dubius? It is not
the manuscript's lineation, encoded with <bkdv3>. I have added
a <plydv3> tag to order text lines differently from the
manuscript/book lineation. Braces determine where this
different verse order breaks from the manuscript lineation:
they are encoded with <xref> and <target> tags. These braces
link rhyming lines. Thus meter is the fundamental unit of
verse structure, not the kind of lineation represented by the
TEI <l> tag, which in fact misleads the reader in this case.
TEI also treats the relations of the core elements for verse to
other tags oddly. Figure 4 lists the elements that the verse
line tag <l> may include, and that may include it. Why does
the <l> tag occur within cast-lists, descriptions of dramatic
settings, and line-groups but not within stanzas or refrains?
A TEI-er would say that a playwright might choose to put verse
in a cast-list and that stanzas are types of textual divisions
and so could be attributes of tags like <div1> and <lg>. Would
a literary historian, however, concede that stanzas or refrains
should invariably be called divisions or line-groups? Metre is
a far more important defining division that makes poetic
structures what they are. What about the poem with only one
stanza, like a limerick? Can a poem with only one thing at the
top level be said to be divided in terms of stanzas? does it
make any sense to say that an entire poem constitutes a line-
group or that one division comprises everything? Sir Gawain
and the Green Knight has fits, the The Faerie Queene has books
and cantos, but what does Beowulf have? Alan Brodeur argued it
had digressions. Are they divisions?
Now consider the things that a TEI verse line tag may contain.
Can it in any world of text familiar to us hold <biblFull>,
which "contains a fully-structured bibliographic citation, in
which all components of the TEI file description are present"
(p. 874)? or <camera>, which "describes a particular camera
angle or viewpoint in a screen play" (p. 882)? or <oRef>, "a
reference to the orthographic form(s) of the headword" (of a
dictionary)? or
Principle 4 also states that "The guidelines should propose
sets of encoding conventions suited for various applications"
(p. 10). TEI admits it does not do so, "since consensus on
suitable conventions for different applications proved elusive;
this remains a goal for future work" (p. 11).
Scholars must revisit the purposes of creating encoded
electronic texts: are they just for text retrieval? If so, I
think we are wasting much of our time, because although
researchers need tools to find references, they are not
themselves working on text-retrieval systems. They trying to
discover new things about texts. And so the question has to be
put, what new things can we do with electronic texts, and then
what kind of tagging do we need to accomplish those things?
Analysis, not retrieval, is the important issue.
Many at this conference have already spent years thinking about
these issues. For these people, more practical issues must
dominate.
First, we do not yet have an inventory of codes for early
characters. Only about a third of the non-ASCII characters RET
editions need so far have ISO entity names. Ideally we should
have the choice of not looking at either codes or standard
displays of these characters -- we might want to see images of
the originals -- but we will still need some way of classifying
what we see. We need to define logically the early character
set, or we are truly building electronic texts on sand. We have
to be able to name characters so that they can be retrieved and
discussed. By developing and insisting upon our own
specialized character set, preferably keyed to images of the
characters we are describing, we can exert pressure on
publishers to deal with the issue of text representation.
Second, we need a basic tagging grammar and tagset for
representing features of what the TEI Guidelines call the
carrier materials. The World Wide Web uses a DTD to handle
basic on-line typesetting. We need an extension of HTML to
name all the literal textual phenomena we see in early books
and then to describe the relations of those parts. After all,
what else do we have to build an interpretation of early texts
on but an accurate representation of the features of the early
manuscript and print culture? Defining even apparently simple
things, such as the hierarchy of pages within sheets within
gatherings, is not easy. As we know, pages from the same side
of the sheet (or from the same form) alternate when the sheet
is folded in text order. How does one tag the form so that the
appropriate pages all nest within the proper form? Try it.
SGML is a text-translation tool. It exists to turn texts
encoded by scholars for their own research purposes into an
interchange format suitable for publication, whether in printed
books or during on-line retrieval. This format must be able to
handle all character sets and all literal, uninterpreted
textual phenomena. TEI did not create such a format, but it
did point us in the right direction. The success of HTML shows
that SGML document-type definitions can be made that defy the
original intentions of SGML itself in respect of procedural
tagging. I hope that this conference will take up again the
spirit of the 1987 Vassar College meeting to create a
metalanguage and encoding method faithful to humanities texts
so that, in due course, an interchange format that meets those
scholarly objectives can be devised.
My perspective is that coding (inputting or
converting text) is not the same as interpreting.
Descriptive coding as presented in the Guidelines
is squarely in the domain of interpretation.
Scholars do not want interpreted texts; they expect
to that job themselves. When possible scholars
hire assistants to input texts, and do not expect
these assistants to do the interpretation. This
whole aspect needs to be brought into conformity
with scholarly practice, otherwise the TEI
standards will not be respected.
SGML was devised for technical writers to encode texts with
their own interpretation of its parts. The author of a text
interprets readily, but editors of other people's texts cannot.
The interpretation of old texts, in particular, asks for a
skeptical mind and a careful critical vocabulary. Thus a
hidden assumption of SGML and TEI is that the encoder has the
authoritative knowledge of the author. The TEI's own Literary
Studies committee -- and a survey of humanities scholars then
on-line used by it -- disagreed, but our objections ran counter
to a principle, that tags should not describe displayable
textual features.
We repeat the advice given at the beginning of this
chapter, that these recommendations are not
intended to meet every transcriptional circumstance
ever likely to be faced by any scholar. They are
intended rather as a base to enable encoding of the
most common phenomena found in the course of
scholarly transcription of primary source
materials. These guidelines particularly do not
address the encoding of physical description of
textual witnesses: the materials of the carrier,
the medium of the inscribing implement, the layout
of the inscription upon the material, the
organisation of the carrier materials themselves
(as quiring, collation, etc.), authorial
instructions or scribal markup, etc. (p. 557)
Some "carrier materials" were recognized, but only in ways that
limit their usefulness in the study of early texts. I will
give two examples.
5. The Trouble with TEI 1: Where did all the Scholars Go?
SGML elements and their attributes belong to a metalanguage,
like the markup in any tagset. They are words about words,
that is, metawords, signs pointing to signs, and thus depend on
a theory of the meaning of text. The theory accepted by TEI is
anachronistic when applied to early texts. For example, SGML
structural tags like <front>,<body>, and <back> (Charles F.
Goldfarb uses these in his definitive manual to SGML) impose a
visual, anthropomorphic metaphor. It is like a person viewed
from the front, from inside, and from the back. Yet does any
medievalist believe texts necessarily or generally have this
structure? Do texts have three parts, comparable to the
printed book's title-page and preliminaries, the things pointed
to by the table of contents, and the closing pages, which may
contain an advertisement or a postscript?
, which is exactly what it seems to be,
"text displayed in tabular form, in rows and columns" (p.
1175)? In what textual universe does it make sense to define a
verse line as potentially containing a one-row table? Now look
at what verse-lines do not contain. There is no tag for
metrical foot. The only encoding that makes it possible to
encode meter is a general tag to be placed in the TEI header to
a document. Surely it should be clear that the TEI Guidelines,
developed by three organizations, two of which were about
computing in the humanities, misinterprets verse structure.
6. The Trouble with TEI 2: A Forgotten Mandate
Section 1.3 of the Guidelines describes the historical
background of the TEI. Nine principles were agreed on by
thirty people who met to plan the TEI in November 1987, myself
among them. The authors admit readily that TEI did not achieve
a number of the important ends of that conference. Principle
3, in particular, was not achieved. It was that
The guidelines should define a recommended syntax for the
format, define a metalanguage for the description of
text-encoding schemes, describe the new format and
representative existing schemes both in that metalanguage
and in prose. (p. 10)
TEI did not do these things. "The only metalanguage used ...
is that of SGML, and no formal definitions are given of other
common encoding schemes" (p. 11). TEI took over what ISO had
already developed for technical writing and publishing and
tried to accommodate scholarly needs within its framework. TEI
also ignored existing encoding schemes, although they had been
developed ground-up by humanities researchers since the late
1960s (e.g., the encoding of the TLG, ARTFL, ADMYTE, COCOA, the
Helsinki and London-Lund corpora, shareware like TACT, and
commercial software such as Oxford Concordance Program and
WordCruncher). TEI was administered by computational
linguists, computer scientists, and computer professionals.
Their goal appeared to shift from developing an interchange
format for humanities texts, based on their features, to
imposing an existing encoding format, SGML, on the field.
7. Where do we Go from here?
Editors should begin by resolving to re-assess the tagging of
electronic texts from first principles, without necessarily
being constrained by any known system, SGML, or TEI, or RET.
The different tasks of tags should be affirmed. They certainly
can instantiate the editorial apparatus of a new medium (and
hence are interpretative), but they also can describe literal
and visible features of text. The scholarly community should
assert its traditional academic freedom in the tagging of
texts. No one editorial standard has emerged after 3,000 years
of work, and I am skeptical that one will emerge. A scholar
should have the right to publish texts encoded in the way he or
she believes best suits the purpose of the analysis.