Text size: / +

Helsinki Corpus TEI XML Edition Documentation

Ville Marttila

For more information on the original version of the Helsinki Corpus, from which this XML version is derived, please refer to its entry in the Corpus Resource Database (CoRD) at http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/index.html.

The manual for the original version of the Helsinki Corpus is available online at http://khnt.hit.uib.no/icame/manuals/HC/INDEX.HTM.

References

Based on a customized TEI schema and its documentation generated with Roma 3.12 by Ville Marttila.

TEI Consortium (ed.). 2011. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Ed. 1.9.1. Available at http://www.tei-c.org/Guidelines/P5/index.xml. Accessed 23.8.2011.

Kytö, Merja (ed.). 1996. Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding Conventions and Lists of Source Texts. Helsinki: Department of English, University of Helsinki.

diPaolo Healey, Antonette with John Price Wilkin and Xin Xiang (eds.). 2009. Dictionary of Old English Corpus. Dictionary of Old English Project. Available on CD-ROM and at http://tapor.library.utoronto.ca/doecorpus/.

Introduction

The conversion of the Helsinki Corpus of English Texts (also known simply as the Helsinki Corpus) into TEI compliant XML was undertaken in late 2010 at the The Research Unit for Variation, Contacts and Change in English (VARIENG), the intellectual heir to the original research team that compiled the Helsinki Corpus. The objective was to bring this seminal resource of English historical corpus linguistics to the 21st century and to ensure that it remained usable and could even be improved upon. The conversion project was timed to coincide with the 20th anniversary of the original Helsinki Corpus, published in 1991. The first version of the TEI XML Edition was unveiled at the Helsinki Corpus Festival: The Past, Present, and Future of English Historical Corpora. Organized from the 28th of September to 1st of October 2011 in Helsinki, the conference celebrated the Helsinki Corpus and the advances made over the last two decades in English historical corpus linguistics.

The TEI XML Edition of the Helsinki Corpus is intended to serve as a prototype and model for the XML conversion of other historical corpora compiled at VARIENG over the last two decades. The present version is not envisioned to be final or definitive in any sense. While this initial version merely replicates the features of the original corpus in a new encoding format, the conversion process has inspired many ideas about possible new features, the addition and implementation of which would enhance the usability and usefulness of the corpus. Given that the format of the original Helsinki Corpus has also influenced various other corpora produced over the last twenty years, the scripts and other tools developed for this conversion could to a significant degree be repurposed for the conversion of several other corpora as well. Consequently, should the required material and human resources become available in the future, the addition of further enhancements to the Helsinki Corpus and the production of new TEI XML editions of other corpora have been made easier and quicker to implement.

1. Structure of the corpus

The TEI XML Edition of the Helsinki Corpus is intended to reproduce the structure of the original Helsinki Corpus in the form of a structured XML document. On a basic ontological level, this means representing what was a series of text files making up a flat 'stream of characters' with an implicit hierarchical structure as an explicitly annotated 'ordered hierarchy of content objects (OHCO)' . Accomplishing this is rarely trivial, but in the case of the Helsinki Corpus, it turned out to be even less so. Because the collection of texts in question represents a wide variety of different time periods and genres, and the internal structure of the corpus exhibits considerable complexity and variability, the work required considerable attention to detail and sensitivity to both the philological heritage of the corpus and to current TEI guidelines. While the multi-leveled hierarchical structure of the corpus is well-suited for representation as an OHCO, its sequential presentation and the occasional overlap between hierarchical levels posed unexpected challenges.

1.1. The structure of the original Helsinki Corpus

The topmost structural division of the Helsinki Corpus is its chronological division into three main and eleven subsidiary parts, indicated both by file name prefixes (the main parts) and the COCOA header. In the multi-file version of the corpus, which was used as the basis for the conversion, however, the most obvious level of hierarchy is the division of the corpus into 242 files. In reality, however, the individual file could not be taken as a basic structural unit because many of the files contain "groups of comparable texts" (Kytö 1996: 40). A text in the Helsinki Corpus is most usefully defined on the basis of the "text identifier" or the 'Q-line' (the value of the code <Q in the COCOA header), which means that a single text can be identified by a header with a new text identifier.

While many of the files contain only a single text, and thus a single header at the beginning of the file, there are also files containing several texts, each preceded by their own header. These have all been treated as individual texts in the TEI XML Edition. In addition to headers indicating the beginning of a new text, the original Helsinki Corpus also contains mid-file headers that do not contain a Q-line and thus do not indicate a new text, but rather a part within a text which was considered to require different metadata parameters. Sections of this kind form another layer of structural division, annotated in the TEI XML Edition as subdivisions of a text.

In addition to these internal headers, texts are also divided by 'sample codes' (e.g. <S SAMPLE 1>) into text samples. These samples form a tesselated structure, covering the contents of the texts in their entirety. While in themselves unproblematic, the samples can interact with the subdivisions in varying ways, being either sub- or superordinate to them, meaning that a sample can both contain several subdivisions and form a part of a subdivision consisting of several samples. There are also some cases of structurally problematic overlap in the corpus, where a subdivision consists of several whole samples and a part of one, necessitating the creation of two subdivisions, one containing the several whole samples and another containing the relevant part of the remaining sample.

While all the divisions discussed above are preserved in the TEI XML Edition of the corpus, its structural organization is based on the text and the sample, with the other divisions accommodated around them.

1.2. Annotating the corpus structure in TEI XML

The XML version of the corpus is realized as a single XML document, having a teiCorpus element as its root element, within which the whole corpus is contained. As per the TEI Guidelines, the contents of the teiCorpus element are made up of a single teiHeader element (the corpus header) which contains all of the metadata pertaining to the whole corpus, and a series of 432 TEI elements, each representing a single corpus text.

These TEI elements in turn each consist of a single teiHeader element, this time containing all of the metadata pertaining to that individual text, including the information encoded in the original COCOA header and the bibliographical information, and a text element, containing the textual content of that corpus text. Internally, each text element is divided into consecutive divisions (div), representing individual samples. Subdivisions in the texts are also represented using div elements, distinguished from samples using the type attribute.

The division of the entire corpus into chronological parts is indicated using the attribute n for each TEI element, containing as its value the 'Part of Corpus' (<C) code of the text. The specific file that contained the text in the original version of the corpus is indicated in the teiHeader of each text, mainly for reference purposes.

Thus, the internal structure of a text containing several samples and subdivisions is represented as a following kind of XML structure:

<TEI xml:id="textID" n="partOfCorpus">
  <teiHeader>
    <!--Text-specific metadata.-->
  </teiHeader>
  <text>
    <div type="sample" n="sample1">
      <!--Textual content of the first sample.-->
    </div>
    <div type="subdivision">
      <div type="sample" n="sample2">
        <!--Textual content of the second sample.-->
      </div>
      <div type="sample" n="sample3">
        <!--Textual content of the third sample.-->
      </div>
    </div>
  </text>
</TEI>

2. The TEI Header

All of the metadata for both the corpus itself as a whole and for each individual text is contained within teiHeader elements included in the root element of the corpus (teiCorpus) and in each of the TEI elements.

2.1. The Corpus Header

The teiHeader for the entire corpus contains three parts, documenting the different aspects of the corpus and its creation process: fileDesc (file description), containing the bibliographic information for the corpus, encodingDesc (encoding description), containing information about the technical aspects of the corpus, and revisionDesc (revision description) containing information on the production process of the TEI XML Edition of the corpus.

2.1.1. File description

The file description contains the information that not only identifies the corpus, but also documents the various people and institutions involved in its production and their roles in a structured way. The first component of the file description is the titleStmt (title statement) which contains the title of the corpus, followed by a series of respStmt (responsibility statement) elements, each documenting an aspect of the corpus creation and the names of the people responsible for it.

This is followed by the publicationStmt (publication statement), documenting the publication and availability of the corpus, the notesStmt, containing various notes pertaining to the document, and the sourceDesc (source description), containing the bibliographic details of the source for the document, in this case, the original Helsinki Corpus.

2.1.2. Encoding description

The encoding description contains information related to the construction process of the corpus and the way various things are encoded and annotated. Its first component is a brief project description (projectDesc), followed by descriptions of the sampling practices and editorial principles (samplingDecl and editorialDecl), which in the case of this corpus amount to references to the manual of the original Helsinki Corpus. These are followed by a list of the XML elements used in the corpus (tagsDecl) and finally, a formal representation of all the taxonomies used in the corpus for classifying texts. These taxonomies represent the reference code values used in the COCOA header of the original version of the Helsinki Corpus in a structured format (see Kytö (1996: 43-56)) and are referred to by the classification elements in the TEI header of each text.

2.1.3. Revision description

The revision description (revisionDesc) consists of a series of change elements which document changes made to the corpus document since its creation and identify the dates of the changes and the persons responsible for them.

2.2. Text Headers

The TEI headers of individual corpus texts are slightly different in structure from the main corpus header. Such a header consists of a file description that contains all of the bibliographic information for the text, drawn both from the original COCOA header and the bibliographic notes contained in the original corpus files.

2.2.1. File description

The file description (fileDesc) contains a bibliographic description of the corpus text. Similarly to the corpus header, its first element is the titleStmt which contains the title of the text in expanded form, along with the name of the author and a reference to the original corpus compilers responsible for the text in question. The title element also contains the values of the <Q ('text identifier') and <N ('name of text') lines of the COCOA header in the original format as the values of the attributes key and n, respectively. The title statements of texts that have separately titled subdivisions (see above) can have several title elements, each with their own ref and n attributes. The original value of the <A ('author') line of the COCOA header is encoded as the value of the key attribute on the author element.

The title statement is followed by the extent element, which indicates first of all, the wordcount of the text, counted simply as the number of whitespace-separated words in the text (excluding the contents of note and sic elements), and secondly the name of the file that contained the text in the original version of the Helsinki Corpus (the values of the <B line in the original COCOA header). The publication statement (publStmt) of an individual corpus text, which is a mandatory part of every teiHeader, consists simply of a reference to the whole corpus document, of which the text is a part.

The source description (sourceDesc) for each corpus text contains one or more bibliographical entries (bibl or biblStruct). A bibl element contains a freeform description of a bibliographical source, and is used in the Old English part of the corpus to refer to the Toronto Corpus of Old English used as the immediate source for the corpus text. A biblStruct element contains a fully structured entry of the bibliographical data found in the note following the COCOA header in the original version of the corpus, including not only the title, editor and publication data for the source edition, but also information on the page ranges included in the corpus, all annotated with the appropriate TEI XML elements.

2.2.2. Profile description

The profileDesc (profile description) element contains all of the classificatory data contained in the COCOA header of the original version. The bulk of the profile description is taken up by a textClass element, containing a single catRef (category reference) element for each of the classificatory lines of the original header (<O through <Z). Each of these empty elements contains the original line identifier letter as the value of the n attribute, along with a reference to one of the taxonomies defined in the corpus header as the value of the scheme attribute, and a reference to a value within that taxonomy as the value of the target attribute. The values of the scheme and target attributes identify a single category element within the classDecl part of the corpus header, containing a prose description of the value. The original value, as found in the COCOA header, is also encoded in the n attribute of each category element for reference purposes.

In addition to the original COCOA lines being represented as category references, the datings of the manuscript and of the original text are also encoded in 'TEI-native' format using two date elements (with type values original and manuscript) within a creation element. Some of the texts in the Middle English part were also localized on the basis of the Linguistic Atlas of Early Middle English (LAEME), and a placeName element has been added inside the creation element, indicating their likely location of origin.

The principal language and dialect used in the text (based on the value of the <D line of the COCOA header) has been encoded using a langUsage element containing a language element, which provides not only a prose description of the language and dialect that the text represents as its content, but also a language identifier constructed according to BCP 47 as the value of the ident attribute.

3. Annotation

Description of the annotation used in the corpus and how it replicates the original annotation.

3.1. Textual structure

The basic textual structure, i.e. paragraph, line and word division, along with headings, follows the original version of the Helsinki Corpus, paragraph and line division being annotated using the relevant TEI elements (p and lb) and word division being implicitly indicated by spaces as in the original.

3.1.1. Paragraph division

According to the manual of the original version (Kytö 1996: 24), the beginning of a new paragraph in the original Helsinki Corpus is indicated by an indentation of three spaces. However, examination of the corpus reveals that also empty lines were used for the same purpose. In the TEI XML Edition, paragraphs (not only their beginnings) are indicated by enclosing them within a p element. In addition to explicitly indicated indicated paragraphs, all segments of text preceded by a header are also considered to constitute paragraphs and annotated as such. The entire content of a text containing no text-medial headings or any explicit indications of paragraphs is annotated as a single paragraph.

3.1.2. Line division

Line division in the original corpus is indicated by the lineation of the text file, apart from long lines which are divided onto two lines, the first (incomplete) one being marked by a line-final hash (#). In the TEI XML Edition, these divided lines are combined onto a single line (as line length is no longer an issue), and all lines of text are explicitly marked using the empty lb element at the beginning of the line. Lines containing only structural annotation, such as page breaks, in the original have not been annotated with this element, since they have no textual content. Also empty lines at the ends of texts and major document divisions, as well as those preceding page breaks have been ignored, where they do not represent significant empty space in the original.

3.1.3. Headings

The annotation of headings in the text follows that of the original, and text contained within the 'Heading' codes [}...}] in the original is contained within a head element in the TEI XML Edition.

3.1.4. Page breaks and other milestones

The annotation of page breaks follows that of the original with the original 'page' codes <P being replaced by the empty pb element. The value of the original page code, which could be a page number, a folio number or a combination of these with a volume number, is encoded as the value of the n attribute on the pb element.

Original page codes indicating both the beginning of a new page and the beginning of the first column of that page have been replaced by a pb element and a cb (column change) element, while page codes indicating the beginning of a subsequent column on the same page have been replaced by a cb element. The cb elements use the n attribute to indicate the running number of the column within the page.

In the original Helsinki Corpus, the page code was also used in Bible texts to indicate chapter and verse divisions, by inserting one every 20 verses. These page codes have been replaced by milestone elements with a type of scriptural. The milestone element, with a type of Toronto is also used to represent the Toronto corpus 'record' code (<R) in texts belonging to the Old English part of the corpus.

3.1.5. Additional structural annotation

In addition to structural annotation based on the original version of the Helsinki Corpus, certain types of texts, namely verse texts, dramatic texts and letters have also received additional structural annotation. Verse texts (as indicated by the <V code in the original header), except those belonging to the Old English part of the corpus, have been annotated for verse lines using the l (verse line) element, with the lg (line group) element used for indicating stanzas and other equivalent groupings of lines. OE verse texts have been left unannotated because their format in the original Helsinki Corpus did not make it possible to add the required l tags automatically, and manual annotation would have been too time-consuming at this stage.

Dialogic texts, including texts categorized as drama, trial reports containing dialogue, and those texts categorized as interactive that contain explicitly indicated dialogue have also been annotated for speech. Individual speech acts have been annotated using the sp (speech) element, containing either paragraphs (prose) or lines (verse) of text. Paragraphs or lines that span several speeches have been broken at speech boundary and annotated with the part attribute (using values I (initial), M (medial) and F (final)) to indicate that they form a part of a paragraph or line. Speaker labels have been annotated using the speaker element.

Letters have been annotated for openers, closers and postscripts where such features have been typographically marked in the editions; the opener, closer and postscript tags have been used for this purpose. Endorsements in the form of an address have also been annotated with a note element having a type of address.

Prose texts have not been annotated for chapters or equivalent structural units in this version of the corpus due to the amount of manual labour required for the work.

3.2. Textual and paratextual features

The annotation of textual and paratextual features in the TEI XML Edition of the Helsinki Corpus follows that of the original version. No additional annotation has been included, and the level of detail in existing annotation has been limited by the original.

3.2.1. Special characters, accents and punctuation

Non-ASCII characters that were represented by various character codes involving the + symbol in the original version have been replaced by the appropriate Unicode character. The ampersand symbol (&), which is a reserved character in XML, is represented by the &amp; entity. Similarly, the quotation mark (") and the apostrophe (') have been represented using the entities &quot; and &apos;, respectively. The following table summarizes the original and TEI XML representations of special characters:

Original Version TEI XML Edition Description
& &amp; ampersand
" &quot; double quotation mark
' &apos; apostrophe
+a æ lower case ash
+A Æ upper case ash
+d ð lower case eth
+D Ð upper case eth
+g ȝ lower case yogh
+G Ȝ upper case yogh
+t þ lower case thorn
+T Þ upper case thorn
+tt lower case crossed thorn
+TT / +Tt upper case crossed thorn
+e ę e caudata
+L £ pound sign

Accents over characters, the different types of which were represented in the original using the same symbol (`) following the accented character, are represented by annotating accented characters with a hi element with a rend attribute value of accented.

The representation of punctuation in the TEI XML Edition follows that of the original version.

3.2.2. Abbreviations and superscripts

All abbreviation symbols found in the source texts were represented by a single symbol, a tilde (~), in the original version. Since the TEI XML version is derived from this version, no distinctions between different kinds of abbreviation can be made. Instead, they are all represented by an empty am (abbreviation marker) element, located at the site of the tilde.

Superscript letters, annotated in the original version by surrounding them with two 'equals' symbols, =...=, are annotated by enclosing them within a hi element with a rend attribute value of sup.

3.2.3. Type changes and runes

Changes of typeface, annotated by the code (^...^) in the original version, are indicated by enclosing words or phrases printed in a different typeface within a hi element with a rend attribute value of type. Since the specific typeface used for these passages is not encoded in the original version, the distinction is not made in the TEI XML Edition either.

In the Old English part of the corpus, words and phrases written in runic characters are distinguished by enclosing them within a hi element with a rend attribute value of rune, replacing the code (}...}) in the original.

3.2.4. Foreign language

Words and phrases in languages other than English was annotated by surrounding it with the code (\...\) in the original version. In the TEI XML Edition, this code is replaced by the foreign element. Since the specific languages used were not annotated in the original, the element does not have an xml:lang attribute.

3.3. Emendation and notes

The original version of the corpus contains several kinds of annotation for editorial and compilatorial intervention, including notes by both the corpus compilers and the editor, and a record of selected editorial emendation.

3.3.1. Editorial emendation

Editorial emendations (corrections, text supplied from other manuscripts, expansions of abbreviations, etc.) made in the source editions were marked up in the original version using the code [{...{]. Although the category of "editorial emendation" covers quite a large variety of editorial changes to the text (see Kytö (1996: 31-32) for more details), their exact nature has not been annotated in the original and cannot thus be indicated in the TEI XML Edition. For this reason, all of these emendations have been indicated using the supplied element to indicate that they originate not in the original manuscript or printed text but are supplied by the editor. To indicate this, the resp (responsibility) attribute is used, its value being an XPath expression that points to the name(s) of the editor(s) in the bibliographic part of the header.

3.3.2. Comments

The original version of the Helsinki Corpus contains two types of comments added to the text: editor's comments and compilers' comments. Both the editor's comments, annotated with the code [\...\] in the original, and the compilers' comments, annotated with the code [^...^], are annotated using the note element in the TEI XML Edition. The source of the comment is indicated using the resp attribute, which points to an element representing the responsible party in the header (either the editor(s) of the source edition or some part of the corpus team). All of the comments, both by editors and by the compilers, are in their original form, i.e. capitalized.

3.3.3. Errata corrections

In addition to notes and editorial emendation contained in the original version of the corpus, the creation of the TEI XML Edition also involved the implementation of a large number of errata corrections collected over the years by the original corpus team. These corrections have been inserted into the corpus using the sic element, containing the original form, and the corr element, containing the corrected form, both contained by the choice element, indicating that these two forms are mutually exclusive alternatives. The resp attribute is used on both the sic and corr elements to indicate the parties responsible for their content. In addition to the corrections themselves, the errata correction process has also resulted in some new note elements in the corpus, being identified by a resp value pointing to the errata correction responsibility statement in the header.

In addition to errata corrections made to the texts, the original version of one text in the Old English Part of the Corpus—The Durham Ritual (xml:id durham)—was deemed to contain so many transcriptional errors that a decision was made to re-source it from a TEI compliant file of the current version of the Dictionary of Old English Corpus (diPaolo Healey, et al., 2009), kindly provided by the Dictionary of Old English Project. Due to the different conventions observed in the current version of the DOEC and the original, the lineation of the current version differs from that of the original.

Appendix: Descriptions of all elements used for annotating the Helsinki Corpus TEI XML Edition

This appendix is a descriptive list of all TEI XML elements used in the TEI XML Edition of the Helsinki Corpus, along with their attributes and their values. It is derived from the TEI Guidelines, but the descriptions are not intended to reflect the use of the elements in the TEI Guidelines generally, but are specific to this corpus and reflect the ways in which the elements have been used in this corpus. For more information on the more general meaning and usage of each element, refer to the TEI Guidelines Element list.

TEI (TEI document)

contains a single TEI-conformant document, comprising a TEI header and a text, containing a single corpus text.

Attributes

xml:id (identifier)

provides a unique identifier for the corpus text.

Value description: The identifier for each text is the abbreviated title of the original Helsinki Corpus text, i.e. the last component of the <Q> line of the original COCOA header, converted to lowercase.

n (number)

indicates the chronological part of the corpus to which the text belongs.

Value description: The value of the <C> line of the original COCOA header, which is also the first component of the <Q> line.

Remarks

A corpus text for the purposes of the organization of the XML version of the Helsinki Corpus has been defined as a part of the original corpus which has its own unique <Q> line (text identifier). This means that a single file of the multi-file version of the Helsinki Corpus can in some cases contain several texts.

am (abbreviation marker)

(empty element) indicates the presence of an abbreviation marker in the source, either above or to the right of the preceding character.

Remarks

Since different types of abbreviation markers are not distinguished in the original version of the Helsinki Corpus, being represented by a tilde (~) regardless of their visual appearance, the element has no content.

Used only within texts.

analytic (analytic level)

contains bibliographic elements describing an item (e.g. an article) published within a monograph or journal and not as an independent publication.

Remarks

The analytic element only occurs within a biblStruct.

Used only within the header.

att (attribute)

contains the name of an attribute appearing within running text.

Remarks

Used only in the revisionDesc element of the corpus header for documenting changes made to the corpus.

author

contains the name of the author of a text, either in the header of a corpus text or in a bibliographic entry.

Attributes

key

contains the value of the <A> field of the original COCOA header for reference purposes.

Value description: The author's name in the original Helsinki Corpus format.

Remarks

Used only for the author element contained in the titleStmt of a text.

ref (reference)

provides a valid XML pointer for the name of the author that can be used to refer to a biographic database or any other external resource.

Remarks

The value does not yet point to any existing XML element, but provides the basis for the future linking of a biographic database to the corpus.

Used only for the author element contained in the titleStmt of a text.

Remarks

This element is used both for annotating the names of the authors of the texts themselves in the header, and for annotating the names of authors of the source editions in the bibliographic entries of the sourceDesc.

The word Anonymous is used instead of a name for anonymous works

Used only within the header.

authority (release authority)

supplies the name of the agency responsible for making the corpus available, other than a publisher or distributor.

Remarks

Used only within the header of the corpus itself.

availability

supplies information about the availability of the corpus, for example any restrictions on its use or distribution, its copyright status, etc.

Remarks

Used only within the header of the corpus itself.

bibl (bibliographic citation)

contains a loosely-structured bibliographic entry of which the sub-components may or may not be explicitly tagged.

Attributes

xml:id (identifier)

provides a unique identifier for the bibliographic entry so that it can be referred to.

Value description: any valid XML identifier.

Remarks

In the case of the same entry occurring several times, the xml:id occurs only on the first instance.

Remarks

Contains phrase-level elements, together with any combination of bibliographic elements.

Used only within the header.

biblScope (scope of citation)

defines the scope of a bibliographic reference, for example as a list of page numbers, or a named or numbered subdivision or item of a larger work.

Attributes

type

identifies the type of information conveyed by the element, when it is relevant for the rendering of the bibliographic entry.

Values:

  • vol (volume) : the element contains a volume number.
  • part : the element identifies a part of a book or collection.
n (number)

encodes the number (or other label) of an item covered by this scope.

Value description: the value is either a number or a string consisting of alphabetical and number characters, with original whitespace represented by underscores.

Remarks

These item numbers or identifiers are the ones provided in parentheses following the page numbers in the bibliographic entries of the original version (and preserved also in the XML version).

Remarks

Used only within bibliographic entries in the header.

biblStruct (structured bibliographic citation)

contains a structured bibliographic entry, in which only bibliographic sub-elements appear and in a specified order.

Attributes

xml:id (identifier)

provides a unique identifier for the bibliographic entry so that it can be referred to.

Value description: any valid XML identifier.

Remarks

In the case of the same entry occurring several times, the xml:id occurs only on the first instance.

Remarks

Used only within bibliographic entries in the header.

body (text body)

contains the body of a corpus text, excluding any front or back matter.

Remarks

Since the Helsinki Corpus does not contain any front or back matter of the texts, the body covers the entire contents of the text element, being retained for reasons of TEI conformance.

castItem (cast list item)

contains a single entry within a cast list, describing a single role.

Remarks

Used only within dramatic texts.

castList (cast list)

contains a single cast list or dramatis personae.

Remarks

Used only within dramatic texts.

catDesc (category description)

describes a category within a taxonomy in the form of a brief prose description.

Remarks

Used only within the header of the corpus itself.

catRef (category reference)

classifies the text in terms of a single typology by specifying a category within one of the taxonomies defined in the corpus header.

Attributes

n (number)

indicates the original Helsinki Corpus reference code corresponding to the particular classification typology.

Value description: The COCOA line identifier (i.e. letter) of the relevant Helsinki Corpus reference code.

scheme

identifies the classification scheme within which the set of categories concerned is defined.

Value description: An XML pointer referencing the xml:id identifier of the associated taxonomy element in the corpus header.

target

indicates the category in the specified taxonomy to which this text (or part of text) belongs.

Value description: An XML pointer referencing the xml:id identifier of the category element of the relevant taxonomy to which the text (or part of text) is associated.

Remarks

The catRef element is an empty one, all of the classificatory information being encoded in the abovementioned attributes.

Used only within the headers of the corpus texts.

category

contains an individual descriptive category within a taxonomy.

Attributes

xml:id (identifier)

provides a unique identifier for the category so that it can be referred to in the headers of the texts.

Value description: a unique XML identifier derived from the original Helsinki Corpus COCOA value.

n (number)

indicates the original Helsinki Corpus reference code value corresponding to the particular category.

Value description: The reference code value corresponding to the category in the original Helsinki Corpus COCOA format.

Remarks

Contains a single catDesc element containing a description of the category.

Used only within the header of the corpus itself.

cb (column break)

(empty element) marks the beginning of a text column in a multicolumn text.

Attributes

n (number)

gives a number for the column beginning at this point.

Value description: the value is one or more space delimited integers representing the running numbers of the text columns on the page spanned by the text following the element.

Remarks

For a two-column page, 1 indicates the left column and 2 the right one, while a value of 1 2 indicates text spanning both columns, i.e. the whole page.

Remarks

By convention, the cb element is placed at the head of the column to which it refers. Columns are numbered within the page, each pb element resetting the column layout, i.e. the page indicated by the next pb element is considered to be a single-column one, unless a new cb element occurs after it.

Used only within texts.

change

summarizes a particular change or series of changes made to a particular version of the corpus.

Attributes

when

indicates the date of the change, in the dd-mm-yyyy format.

Value description: A normalized form of temporal expression conforming to the W3C XML Schema Part 2: Datatypes Second Edition.

who

indicates the person or persons responsible for the change.

Value description: An XML pointer referring to the xml:id of a name element in the responsibility statements.

n (number)

indicates the version number of the corpus which contains the change described by the element.

Value description: a decimal format version number (e.g. 1.03).

Remarks

Each change element contains a prose description of the changes made to the corpus.

Used only within the header of the corpus itself.

choice

groups together a transcription or annotation found to be erroneous in the original corpus and its correction.

Remarks

Contains a single error annotated by a sic element and its associated correction annotated by a corr element.

Used only within texts.

classDecl (classification declarations)

contains taxonomies defining the classificatory codes used to classify and describe the texts.

Remarks

Contains formal representations of the taxonomies used to describe the texts in the COCOA headers of the original Helsinki Corpus, connecting together the original reference code values, the corresponding XML identifiers used in the XML version of the corpus and their explanations.

Used only within the header of the corpus itself.

closer

groups together salutations, datelines, and similar phrases appearing as a final group at the end of a division, especially of a letter.

Remarks

Used only within texts.

corr (correction)

contains the correct transcription or annotation of a word or phrase found to be erroneous.

Attributes

resp (responsible party)

indicates the agency responsible for the correction.

Value description: A pointer to a responsibility statement in the corpus header.

Remarks

The most common value used for this attribute in the corpus is #HC_XML_errata_corrections, referring to the editors responsible for the errata corrections made to the XML version of the Helsinki Corpus.

Remarks

In the corpus, the corr element always occurs within a choice element, together with a sic element indicating the original, erroneous reading.

These corrections are based on errata files collected since the publication of the original version of the Helsinki Corpus by the compilation team, most notably Matti Kilpiö.

Used only within texts.

creation

contains information about the creation date and place of a text.

Remarks

Contains two date elements, one of type original and another of type manuscript, indicating the date ranges for the creation of the original work and of the manuscript version used as a source, based on the original Helsinki Corpus COCOA header.

Middle English texts that have been localized on the basis of the Linguistic Atlas of Early Middle English (LAEME) also contain a placeName element indicating the likely place of the text's creation.

Used only within the headers of the corpus texts.

date

contains a date in any format.

Attributes

type

is used to distinguish between different types of dates for a single text or publication.

Values:

  • original : indicates that the date range refers to the original composition of the text
  • manuscript : indicates that the date range refers to the production of the manuscript copy of the text used as a source
  • first : (in bibliographic entries) indicates that the date refers to the original publication date of the first edition of the work
when

supplies the value of the date, to a year, in a standard form, i.e. yyyy.

Value description: A normalized form of temporal expression conforming to the W3C XML Schema Part 2: Datatypes Second Edition.

from

indicates the starting point of a time period, to a year, in a standard form, i.e. yyyy.

Value description: A normalized form of temporal expression conforming to the W3C XML Schema Part 2: Datatypes Second Edition.

to

indicates the ending point of a time period, to a year, in standard form, i.e. yyyy.

Value description: A normalized form of temporal expression conforming to the W3C XML Schema Part 2: Datatypes Second Edition.

Remarks

Used only within the header.

distributor

supplies the name of the agency responsible for the distribution of the corpus.

Remarks

Used only within the header of the corpus itself.

div (text division)

contains a subdivision of a corpus text, reflecting either an original document division in the source or a structural component of the original Helsinki Corpus.

Attributes

type

is used to indicate the type of the division.

Values:

  • sample : indicates that the division represents a single sample in the original version of the Helsinki Corpus, originally indicated by the code <S>
  • subdivision : indicates that the division represents a subdivision within a text
  • document : indicates that the division represents an individual administrative, legal or other type of document within a corpus text
  • letter : indicates that the division represents an individual letter within a corpus text
  • scene : indicates that the division represents a scene (complete or incomplete) in a dramatic corpus text

Remarks

Divisions of different types can nest inside each other, which means that there can be for example a sample division containing several document divisions and a subdivision, which again contains one or more document divisions.

All of the textual content of the corpus is contained by div elements of type sample, each text being divided into one or more samples.

A subdivision is distinguished from the rest of the text either by having a separate title (linked to the subdivision's xml:id by a ref attribute of the title element in the header) or a separate text classification (linked to the subdivision by a decls attribute pointing to the xml:id of the classification) associated with it. In the original corpus, these subdivisions had a separate COCOA header of their own in the middle of the corpus text, but no unique <Q> line. Subdivisions can also coincide with samples, in which case they are not separately annotated, the relevant attributes being inserted to the sample div instead.

It should be noted, that while div elements of types sample and subdivision represent structural features of the original Helsinki Corpus, the other types represent structural features of the source texts.

n (number)

contains the identifier of a sample.

Value description: The value of the <S> code in the original Helsinki Corpus.

Remarks

Used only on div elements of type sample.

xml:id (identifier)

provides a unique identifier for those divisions that have a different title from the rest of the text.

Value description: The identifier for these text divisions consists of the identifier of the text, a full stop (.), and the original Helsinki Corpus name for this part of the text (the value of the <N> line of its original COCOA header) in lowercase format with whitespaces replaced by underscores.

Remarks

These divisions - which can be either whole samples, subdivisions of samples or subdivisions of the whole text containing several samples - were indicated in the original Helsinki Corpus by a COCOA header inside the corpus text, containing an <N> line whose value was different from that of the initial header.

decls

identifies a textClass element within the header, which applies to the text division in question instead of the primary classification of the text.

Value description: A valid XML pointer referencing the xml:id of the relevant textClass element in the header.

Remarks

Divisions bearing this attribute - which can be either whole samples, subdivisions of samples or subdivisions of the whole text containing several samples - were indicated in the original Helsinki Corpus by a COCOA header inside the corpus text, containing differences on some of the lines apart from the <N> line but not containing a <Q> line.

Remarks

Scenes in dramatic texts have only been annotated where they are explicitly indicated by a heading. Chapters and other structural divisions in prose texts have not been annotated in the current version, since they have not been explicitly indicated in the original vesion of the Helsinki Corpus and their manual or semi-automatic annotation is beyond the project's current resources.

Used only within texts.

edition (edition)

indicates the edition of a text used as a source in a bibliographic entry.

Remarks

Used only within the header.

editor

secondary statement of responsibility for a bibliographic item or a corpus text.

Attributes

role

used to specify further information about the editor in order to qualify his or her role in the production of the corpus or a bibliographic item.

Values:

  • compiler : indicates that the editors referred to by this element are responsible for the corpus compilation of the text in which the editor element occurs
  • general : indicates that the editor annotated by this element is the general editor of a source edition
  • facsimile : indicates that the editor annotated by this element is the facsimile editor of a source edition
ref (reference)

contains an XML pointer pointing to the responsibility statement indicating the corpus compilers responsible for this text.

Remarks

Used only for the editor element contained in the titleStmt of a text.

n (number)

used to number multiple editors of a bibliographic item.

Value description: the value is an integer, unique to the individual editor within that biblStruct.

Remarks

Used only for editor elements occurring within bibliographic items.

Remarks

The editor element can contain either the name of an individual editor or be empty, in which case the ref attribute is required.

Used only within the header.

editorialDecl (editorial practice declaration)

provides details of editorial principles and practices applied during the encoding of the corpus.

Remarks

Used only within the header of the corpus itself.

encodingDesc (encoding description)

specifies the methods and editorial principles which governed the transcription or encoding of the corpus, as well as the tags and classificatory taxonomies that are used to annotate and describe it.

Remarks

Used only within the header of the corpus itself.

extent

contains elements describing the size and extent in terms of its wordcount and the original Helsinki Corpus file it was stored in.

Remarks

Used only within the header.

fileDesc (file description)

contains a full bibliographic description of the corpus text.

Remarks

For the entire corpus, provides all the bibliographic information, responsibility statements and the publication and source information for the XML version of the corpus.

For individual corpus texts, provides the title of the corpus text (both in the original HC format and in expanded form), along with its original author and a reference to the corpus compiler responsible for the particular text in the original corpus. Also contains a full bibliographic description for the source or sources from which the electronic text was derived inside a sourceDesc element.

Used only within the header.

foreign (foreign)

identifies a word or phrase as belonging to some language other than that of the surrounding text.

Remarks

The xml:lang attribute has not been used to indicate the language, as this information was not encoded in the original version of the Helsinki corpus.

The principal language of the text, from which the foreign element indicates a divergence, is indicated by the value of the ident attribute of a language element in the langUsage section of the profileDesc, located in the TEI header of the text.

All text segments annotated using the (\...\) annotation in the original Helsinki Corpus have been annotated using this element.

Used only within texts.

forename

contains a forename, given or baptismal name of an author.

Remarks

The only names annotated for their component parts are those of the original authors of the corpus texts, contained in an author element in the titleStmt within the header of individual corpus texts.

Used only within text headers.

funder (funding body)

specifies the names of the institutions or organizations responsible for the funding of the corpus.

Remarks

Used only within the header of the corpus itself.

genName (generational name component)

contains a name component used to distinguish otherwise similar names on the basis of the relative ages or generations of the persons named (e.g. VIII or Jr).

Remarks

The only names annotated for their component parts are those of the original authors of the corpus texts, contained in an author element in the titleStmt within the header of individual corpus texts.

Used only within text headers.

gi (element name)

contains the name (generic identifier) of an element.

Remarks

Used only in the revisionDesc element of the corpus header for documenting changes made to the corpus.

head (heading)

contains any type of heading.

Remarks

Since the texts in the original version of the Helsinki Corpus were not annotated for chapter or section divisions, no div elements have been added to indicate chapters or sections in prose texts, meaning that headings can also occur in the middle of divisions, alternating with other paragraph-level elements. This has required a slight modification to the TEI schema, which only allows the head element in the beginning part of a div element.

Used both within texts and their TEI headers.

hi (highlighted)

marks a character, word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made.

Attributes

rend (rendition)

indicates the way in which the content of the element was graphically distinct in the original.

Values:

  • rune : indicates that the character or a sequence of characters was written in a runic alphabet in the original (annotated by (}...}) in the original Helsinki Corpus)
  • sup : indicates that the character or a sequence of characters was written as superscript in the original (annotated by =...= in the original Helsinki Corpus)
  • type : indicates that the word or phrase was printed using a typeface different from the surrounding text in the original (annotated by (^...^) in the original Helsinki Corpus)
  • accented : indicates that the character had an accent symbol of some kind in the original (indicated by the generic accent marker ` following the accented character)

Remarks

The specific typeface or the kind of accent used is not indicated in the original version of the Helsinki Corpus.

Remarks

Used only within texts.

idno (identifier)

contains the identifier of the file which contained the text in the original multi-file version of the Helsinki Corpus

Attributes

type

categorizes the identifier.

Values:

  • file : indicates that the identifier is a file name

Remarks

Used only within the header.

imprint

groups information relating to the publication or distribution of a bibliographic item.

Remarks

Used only within bibliographic entries in the header.

l (verse line)

contains a single, possibly incomplete, line of verse.

Attributes

part

specifies whether or not the line is metrically complete.

Values:

  • I (initial) : the initial part of an incomplete line
  • M (medial) : a medial part of an incomplete line
  • F (final) : the final part of an incomplete line

Remarks

The l element is used to annotate just the metrical line and does not imply a physical line change; the latter are annotated using the lb element.

Used only within texts.

langUsage (language usage)

describes the principal language and dialect represented by the text.

Remarks

Contains a single language element encoding the principal language variant used in the text, based on the original Helsinki Corpus COCOA header.

Used only within the headers of the corpus texts.

language

characterizes a single language or sublanguage used within a text.

Attributes

ident (identifier)

Supplies a language code constructed as defined in BCP 47 which is used to identify the language documented by this element.

Remarks

An informal prose characterization of the language and dialect in question is supplied as content for the element.

Used only within the headers of the corpus texts.

lb (line break)

(empty element) marks the start of a new (typographic) line in the source edition or version of the text.

Remarks

Following TEI convention, lb elements appear at the point in the text where a new line starts. This element is used for marking actual line breaks on a manuscript or printed page, at the point where they occur. Lines containing only annotation (and thus not representing a line in the source edition or manuscript) in the original version of the Helsinki Corpus have not been annotated using this element. Lines of more than 80 characters that were divided onto two lines using the # symbol at the end of the line have been consolidated into a single line. For this reason, the number of lines in the original version does not correspond to the number of annotated lines in the XML version.

Used only within texts.

lg (line group)

contains a group of verse lines functioning as a formal unit, e.g. a stanza, refrain, verse paragraph, etc.

Attributes

part

specifies whether or not the line group is fragmented by some other structural element, for example a speech which is divided between two or more line groups.

Default value: N

Values:

  • I (initial) : the initial part of an incomplete line group
  • M (medial) : a medial part of an incomplete line group
  • F (final) : the final part of an incomplete line group

Remarks

Line groups parallel paragraphs in prose texts, and are based on the paragraph encoding of the original Helsinki Corpus.

Used only within texts.

listBibl (citation list)

contains a list of bibliographic entries of any kind (either bibl or biblStruct).

Remarks

Used only within bibliographic entries in the header.

measure

(empty element) indicates the word count of an individual corpus text.

Attributes

quantity

specifies the number of words contained in the text

Value description: A positive integer with no thousands separator.

unit

indicates the units used for the measurement.

Values:

  • words : indicates that the quantity refers to a number of word tokens

Remarks

The measure element is used exclusively within the extent element in text headers for indicating the number of word tokens contained in the text. A word token is here defined as a whitespace separated string, and includes symbols separated on both sides by whitespace, such as an ampersand (standing in for and), but excluding punctuation which is not separated by whitespace from the previous word.

Used only within the header.

milestone

(empty element) marks a boundary point separating any kind of section of a text, indicating a point at which some part of a standard reference system changes.

Attributes

type

is used to indicate the type of reference system.

Values:

  • Toronto : indicates the record numbering of the Toronto Dictionary of Old English Corpus (encoded by the <R> code in the original version of the corpus)
  • scriptural : indicates the scriptural convention of numbering by chapter and verse (encoded in the original using the <P> code, otherwise used for page numbering)
unit

provides a conventional name for the kind of section changing at this milestone.

Values:

  • record : Toronto Corpus records
  • chapter-verse : scriptural chapters and verses (annotated every 20 verses, i.e. 1, 20, 40...)
n (number)

gives a number (or other label) for the milestone.

Value description: the values contain letters, digits and punctuation characters.

Remarks

The value of the n attribute is the value contained by the <R> or <P> element in the original version of the Helsinki Corpus.

Remarks

Used only within texts.

monogr (monographic level)

contains bibliographic elements describing an item (e.g. a book) published as an independent item (i.e. as a separate physical object).

Remarks

The monogr element only occurs within a biblStruct.

Used only within the header.

name (name, proper noun)

contains a personal name.

Remarks

The name element is used exclusively in respStmt elements for identifying the various people who have participated in the compilation of the original Helsinki Corpus and the production of the XML version.

Used only within the header.

nameLink (name link)

contains a connecting phrase or link used within a name but not regarded as part of it, such as of.

Remarks

The only names annotated for their component parts are those of the original authors of the corpus texts, contained in an author element in the titleStmt within the header of individual corpus texts.

Used only within text headers.

namespace

supplies the formal name of the namespace to which the elements documented by its children belong.

Attributes

name

the full formal name of the namespace concerned.

Remarks

Used only within the header of the corpus itself.

note

contains a note or annotation.

Attributes

type

is used to indicate specific types of notes.

Values:

  • address : indicates an original note endorsed to a letter, containing an address or delivery instructions
resp (responsible party)

indicates the agency responsible for the note.

Value description: A pointer to an element in the text header or the corpus header.

Remarks

The same annotation is used for notes added by the editors of the editions used as the source of the corpus text and notes added by the corpus compilers, the resp attribute being used to distinguish between the two. Corpus compilers are referred to using the xml:id values of the relevant respStmt elements in the corpus header, while the editor of a source edition is referred to by an XPath expression pointing to the editor element(s) of the relevant bibliographic item, e.g. #xpath1(ancestor::TEI//sourceDesc//biblStruct[1]/monogr/editor).

The lack of a resp attribute on a note element means that the note is in the original source.

Remarks

Used both within texts and their TEI headers.

notesStmt (notes statement)

collects together any notes providing information about the corpus additional to that recorded in other parts of the bibliographic description.

Remarks

Contains a series of note elements providing additional information about the XML version of the corpus, its documentation and its source, the original version of th Helsinki Corpus.

Used only within the header of the corpus itself.

opener

groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a letter.

Remarks

Used only within text divisions of the type letter.

p (paragraph)

marks paragraphs or paragraph-like divisions in prose.

Attributes

part

specifies whether or not the paragraph is fragmented by some other structural element, for example a speech which is divided between two or more paragraphs.

Default value: N

Values:

  • I (initial) : the initial part of an incomplete paragraph
  • M (medial) : a medial part of an incomplete paragraph
  • F (final) : the final part of an incomplete paragraph

Remarks

The TEI Guidelines (version 1.9) do not allow the part attribute for the p element, although they do allow it for the ab element. The TEI Guidelines have here been extended to allow also paragraphs to be incomplete; this extension will also be suggested for the Guidelines themselves.

Remarks

All paragraphs that were indicated as such in the original Helsinki Corpus, either by three whitespaces at the head of a line or by a preceding empty line, have been annotated using this element. Also sections of text delimited by headings but not explicitly annotated as paragraphs in the original have been treated as paragraphs in this XML version.

Used both within texts and their TEI headers.

pb (page break)

(empty element) marks the boundary between one page of a text and the next in the source edition.

Attributes

n (number)

gives a number (or other label) for the page beginning at this point.

Value description: the values contain letters, digits and punctuation characters, indicating either a page number, a folio number (with recto and verso indicators) or a combination of a volume identifier and page number.

Remarks

The value of the n attribute is the value contained by the <P> element in the original version of the Helsinki Corpus, except in the case of multi-column texts, where the column denominator has been removed and replaced by a separate cb element.

Remarks

Following the TEI convention, pb elements appear at the start of the page to which they refer. The global n attribute indicates the number or other value associated with the page which follows.

Used only within texts.

placeName

contains a geographic place name defining a place or a region.

Remarks

This element is used only for annotating the place of origin for those Middle English texts for which the Linguistic Atlas of Early Middle English (LAEME) provides a localization.

Used only within text headers.

postscript

contains a postscript to a letter.

Remarks

Used only within text divisions of the type letter, following a closer.

profileDesc (text-profile description)

provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, and the participants and their setting.

Remarks

Contains a textClass element replicating the classificatory part of the Helsinki Corpus COCOA header, along with the dating and language information for the text encoded in a 'TEI-native' format using the creation and langUsage elements.

Used only within the headers of the corpus texts.

projectDesc (project description)

briefly describes the XML conversion project of the Helsinki Corpus and provides reference to information on the original Helsinki Corpus project.

Remarks

Used only within the header of the corpus itself.

pubPlace (publication place)

contains the name of the place where a bibliographic item was published.

Attributes

n (number)

used to number multiple publication places of a bibliographic item.

Value description: the value is an integer, unique to the individual publication place within that biblStruct.

Remarks

Used only within bibliographic entries in the header.

publicationStmt (publication statement)

groups information concerning the publication or distribution of the corpus.

Remarks

The corpus TEI header contains the full publication information of the corpus, while the headers of individual texts contain merely a reference to the whole corpus.

Used only within the header.

publisher

provides the name of the organization responsible for the publication or distribution of a bibliographic item.

Remarks

Used only within bibliographic entries in the header.

ref (reference)

defines a reference to another location, either within the document or outside it.

Attributes

type

is used to distinguish between different types of references.

Values:

  • hyperlink : indicates that the reference is to be treated as a hyperlink pointing outside the document itself
  • bibl : indicates that the reference is to be treated as a bibliographical reference to a bibliographic item defined somewhere in the corpus
target

specifies the destination of the reference by supplying one or more URI References.

Value description: A syntactically valid URI reference with no whitespace (any whitespace is escaped by encoding it as %20).

Remarks

Used only within the header.

region

contains the name of an administrative unit such as a state, province, or county, larger than a settlement, but smaller than a country.

Remarks

This element is used only for annotating the place of origin for those Middle English texts for which the Linguistic Atlas of Early Middle English (LAEME) provides a localization.

Used only within text headers.

resp (responsibility)

contains a phrase describing an aspect of the production of the XML version of the corpus.

Remarks

Used only within the header.

respStmt (statement of responsibility)

supplies a statement of responsibility for some specific aspect of the production of the XML version of the corpus.

Remarks

Each respStmt contains a single resp element describing a specific responsibility and a list of name elements annotating the names of the people responsible for this aspect of the XML conversion.

Used only within the header.

revisionDesc (revision description)

summarizes the revision history for the corpus file.

Remarks

Contains a series of change elements reflecting various stages of the XML conversion process.

Used only within the header of the corpus itself.

roleName

contains a name component which indicates that the referent has a particular role or position in society, such as an official title or rank.

Remarks

The only names annotated for their component parts are those of the original authors of the corpus texts, contained in an author element in the titleStmt within the header of individual corpus texts.

Used only within text headers.

samplingDecl (sampling declaration)

provides information on the rationale and methods used in sampling texts in the creation of the corpus.

Remarks

Used only within the header of the corpus itself.

series (series information)

contains information about the series in which a book or other bibliographic item has appeared.

Remarks

The series element only occurs within a biblStruct.

Used only within the header.

settlement

contains the name of a settlement such as a city, town, or village.

Remarks

This element is used only for annotating the place of origin for those Middle English texts for which the Linguistic Atlas of Early Middle English (LAEME) provides a localization.

Used only within text headers.

sic (Latin for thus or so)

used to annotate a word or phrase that has been found to be incorrectly or inaccurately transcribed or annotated.

Attributes

resp (responsible party)

indicates the agency responsible for the erroneous transcription or annotation.

Value description: A pointer to an element in the text header or the corpus header.

Remarks

The most common values used for this attribute refer to the compilers of the original version of the corpus, enumerated in the header. In addition, values referring to the edition used as a source of the text also occur.

Remarks

In the corpus, the sic element always occurs within a choice element, together with a corr element indicating the corrected reading.

Used only within texts.

sourceDesc (source description)

describes the source from which the electronic text was derived or generated in the form of a bibliographic description.

Remarks

Used only within the header of the corpus itself.

sp (speech)

An individual speech in a performance text, or a passage presented as such in a prose or verse text.

Remarks

If speaker labels are present, they are contained within the sp element, along with all the p or lg elements that make up the content of the speech.

Semi-automatically annotated for performance texts and other texts containing unambiguously indicated dialogue in the XML version, based on speaker labels and other cues in the original version, supplemented by manual editing of the annotation.

Used only within texts.

speaker

A specialized form of heading or label, giving the name of one or more speakers in a dramatic text or fragment.

Attributes

rend (rendition)

indicates that the speaker label was graphically distinct in the original in some way.

Values:

  • type : indicates that the speaker label was printed using a typeface different from the surrounding text in the original (annotated by (^...^) in the original Helsinki Corpus)

Remarks

The specific typeface used is not indicated in the original version of the Helsinki Corpus.

Remarks

Semi-automatically annotated for performance texts in the XML version, based on type changes and other annotated cues in the original version, supplemented by manual editing of the annotation.

Used only within texts.

sponsor

specifies the name of a sponsoring organization, which provides its intellectual authority to the project.

Remarks

Used only within the header of the corpus itself.

stage (stage direction)

contains any kind of stage direction within a dramatic text.

Attributes

rend (rendition)

indicates that the stage direction was graphically distinct in the original in some way.

Values:

  • type : indicates that the stage direction was printed using a typeface different from the surrounding text in the original (annotated by (^...^) in the original Helsinki Corpus)

Remarks

The specific typeface used is not indicated in the original version of the Helsinki Corpus.

Remarks

Manually annotated for performance texts in the XML version.

Used only within texts.

supplied

signifies text supplied or emended by the editor of the source text, indicated by the Emendation code [{...{] in the original version of the Helsinki Corpus.

Attributes

resp (responsible party)

indicates the agency responsible for the emendation.

Value description: A pointer to an element in the text header or the corpus header.

Remarks

The editor of a source edition is referred to by an XPath expression pointing to the editor element(s) of the relevant bibliographic item, e.g. #xpath1(ancestor::TEI//sourceDesc//biblStruct[1]/monogr/editor).

Remarks

The supplied element is used to replace the 'Emendation' code [{...{] used in the original version of the Helsinki Corpus and thus inherits its wide scope and semantic ambiguities. As the manual of the original version states, it is used for encoding a variety of things, including italicized expansions of abbreviations by the editor (unless they "occur repeatedly and frequently throughout the text, in which case they have been left uncoded"), emendations made to the text on the basis of other manuscript versions and text supplied from other manuscript versions and corrections.

When emendations indicated by italics were encoded by the original compilers of the Helsinki Corpus, the emendation code was used to cover the whole word, which practice is reflected in this XML version.

In the Old English part of the corpus, individual characters supplied or emended within a word were coded on the level of the whole word (following the practice of the Toronto Corpus), enclosing the whole word by the brackets, while in the Middle English and Early Modern English parts, only the characters actually supplied or emended were enclosed within the brackets. These practices are reflected also in the use of the supplied element in the XML version of the corpus.

Used only within texts.

surname

contains a family (inherited) name of an author.

Remarks

The only names annotated for their component parts are those of the original authors of the corpus texts, contained in an author element in the titleStmt within the header of individual corpus texts.

Used only within text headers.

tagUsage

supplies information about the usage of a specific element within a text.

Attributes

gi (element name)

the name (generic identifier) of the element indicated by the tag.

Value description: the name of an element

occurs

specifies the number of occurrences of this element in the corpus.

Value description: an integer number greater than zero

Remarks

Used only within the header of the corpus itself.

tagsDecl (tagging declaration)

provides detailed information about the tagging applied to a document.

Remarks

Used only within the header of the corpus itself.

taxonomy

defines a typology used to classify texts by a structured taxonomy.

Attributes

xml:id (identifier)

provides a unique identifier for the typology so that it can be referred to in the headers of the texts.

Value description: a valid XML identifier.

n (number)

indicates the original Helsinki Corpus reference code corresponding to the particular classification typology.

Value description: The COCOA line identifier (i.e. letter) of the relevant Helsinki Corpus reference code.

Remarks

Contains a single category element describing each class of the taxonomy, in addition to a brief description of the categorization.

Used only within the header of the corpus itself.

teiCorpus

the root element which contains the whole of the TEI encoded Helsinki Corpus, comprising a single corpus header and a series of TEI elements, each containing a single text with its header.

Attributes

xml:id (identifier)

provides a unique identifier for the entire corpus for reference purposes.

Values:

  • HC_XML
n (number)

indicates the version number of the corpus.

Value description: a decimal format version number (e.g. 1.03).

version

The version of the TEI scheme used for encoding the corpus

Value description: a TEI version number

teiHeader (TEI Header)

supplies the descriptive and declarative information for the corpus and for each of its individual texts.

Remarks

Consists of a fileDesc containing the bibliographic information for the file and the text contained by it and a profileDesc containing all of the information describing the linguistic and contextual profile of the text.

The header for each text contains all of the information contained in the original COCOA header, along with the bibliographic information contained in the compiler's note following it, in a structured, TEI conformant format.

text

contains the textual content of a single corpus text.

Remarks

The text element contains the original Helsinki Corpus text, omitting the COCOA header and the following note containing the bibliographic information.

textClass (text classification)

groups information which describes the linguistic and contextual features of a text or a part of text in terms of a formal classification scheme.

Attributes

xml:id (identifier)

provides a unique identifier for the classification so that it can be referred to by a decls attribute in the text to link parts of the text to a specific classification.

Value description: a unique XML identifier consisting of the xml:id value of the text and the word classification, combined by an underscore. In the case of multiple text classifications for a single text, each classification after the first has a running number appended to the identifier by an underscore. In cases where corrections have been made to the values used in the original version of the Helsinki Corpus, the xml:id value of the classDecl containing the original values is suffixed with _old

default

indicates whether or not this classification is the default classification for the text or not.

Values:

  • true : This classification is the default one (or the only one) for the text, applying to all parts of the text which do not have a decls attribute referring to an alternative classification
  • false : This classification is not the default one for the text, and only applies to those parts of the text which have a decls attribute referring to its xml:id.

Remarks

The textClass element replicates the classificatory part of the original Helsinki Corpus COCOA header, containing a catRef element for each line of the COCOA header, referencing categories of the taxonomies defined in the corpus header.

In cases where errata corrections have been made to the parameter values of the original version, the default textClass element contains the corrected values, while the original values are preserved for reference in a separate textClass element identified as the 'old' version.

Used only within the headers of the corpus texts.

title

contains a title for any kind of work, including the corpus itself, individual corpus texts or sources.

Attributes

key

contains the original Helsinki Corpus identifier or Q-line of the text.

Value description: The value of the <Q> field of the original COCOA header for the text.

Remarks

Used only for the title element contained in the titleStmt of a text.

ref (reference)

provides a valid XML pointer for the title of the text that can be used to refer to a bibliographic database or any other external resource.

Value description: A valid XML pointer made up of the xml:id of the text and the value of the <N> field of the original COCOA header with all characters lowercased and whitespaces replaced by underscores separated with a full stop.

Remarks

The value does not yet point to any existing XML element, but provides the basis for the future linking of a bibliographic database to the corpus.

Used only for the title element contained in the titleStmt of a text.

n (number)

contains the name of the text in original Helsinki Corpus format, i.e. the value of the <N> field of the original COCOA header.

Value description: The value of the <N> field of the original COCOA header for the text.

Remarks

Used only for the title element contained in the titleStmt of a text.

level

indicates the bibliographic level for a title, that is, whether it identifies an article, book, journal, or series.

Values:

  • a (analytic) : analytic title (article or other item published as part of a larger item)
  • m (monographic) : monographic title (book, collection, or other item published as a distinct item, including single volumes of multi-volume works).
  • j (journal) : journal title
  • s (series) : series title

Remarks

Used only for title elements occurring within bibliographic entries.

Remarks

The title element can also occur without any attributes, merely annotating a stretch of text as the title of a work for formatting purposes.

Used only within the header.

titleStmt (title statement)

groups information about the title of a text and those responsible both for its intellectual content and its inclusion in the corpus.

Remarks

For the entire corpus, provides the title and the responsibility statements for the XML version of the corpus.

For individual corpus texts, provides the title, the author and the compiler of the text.

Used only within the header.

val (value)

contains a single attribute value.

Remarks

Used only in the revisionDesc element of the corpus header for documenting changes made to the corpus.