Shakespeare's Textual Variations

New Insights from Information Theory

Raouf Hamzaoui Scholars are unsure just what William Shakespeare wrote. We now know that plays published under his name contain contributions from other dramatists, and that he had a hand in others' plays. Moreover, half of Shakespeare's plays are known to us via multiple early versions whose differences might reflect revision of the play by Shakespeare and/or someone else, or censorship, or corruption of the text in scribal and print transmission.

This is a project is funded Academies Partnership in Supporting Excellence in Cross-Disciplinary Research (APEX) scheme, grant APX\R1\241032. Two researchers at De Montfort University, Professor Gabriel Egan (expert in Shakespeare) and Professor Raouf Hamzaoui (expert in Information Theory), will collaboratively explore the differences between the early editions of Shakespeare using new information-theoretic techniques that shed light on literary style, habits of revision, censorship, and textual corruption in ways not previously possible. This work is timely as the full set of plays (Shakespeare's and other writers') has only recently become available to investigators as large numbers of well-curated digital texts.

Research Context

With one minor exception, Shakespeare left us none of his dramatic manuscripts: we know his plays only via the printed editions made in his lifetime (1564-1616) and shortly after. For half of Shakespeare's plays we have multiple early editions that differ for poorly understood reasons. During Shakespeare's lifetime, his plays were printed in relatively cheap one-volume-per-play books we call quartos, named for how the printed sheets are folded. In 1623 there appeared a collected-plays edition known as the First Folio, again named for how the printed sheets are folded, containing 36 plays, the core of Shakespeare's canon.

Several plays, including Titus Andronicus and three plays about King Henry VI, appear to have been co-written with at least one other dramatist, adding further complexity. Possible explanations for textual variation between early versions include: corruption in transmission by handwriting and printing; censorship; and conscious artistic revision. Computational stylistics has given us style-of-writing profiles for Shakespeare and most of his contemporary dramatists and detected non-authorial revision by the resultant changes in style across a work. The latest Oxford University Press complete works finds Thomas Middleton's style present in the 'Fly Scene' found only in Folio Titus Andronicus and in the expansion of the gulling of Paroles in All's Well that Ends Well; both apparently are revisions made for revival of the plays after Shakespeare's death.

Information Theory techniques offer the possibility to take these distinctions further, in particular by distinguishing interventions such as censorship that focus on particular semantic categories (religious or sexual terms) and textual corruption. Play scripts had to be examined and licensed by the state censor before they could be performed in public, with rules tightening in 1606, and additional censorship applied before a play could be printed; we also know that certain printers applied their own rules about acceptable language. Some verbal substitutions are detectable by eye, such as 'by heaven' for 'by God', but a systematic search for even these simple cases has yet to be undertaken. The project will need to go beyond applying the currently known techniques of Information Theory: it will attempt to generate wholly new techniques specifically to engage with what literary studies tell us about the nature of linguistic creativity.

Reseach Methods

The Nature of the Data

The first thing we need for this work is a set of digital texts of Shakespeare's plays. There are lots of these available for free on the Worldwide Web but most are unsuitable for our work because they are based on printed editions made in the nineteenth-century. There is nothing intrinsically wrong with nineteenth-century editions -- they give very readable versions for most purposes -- but they inevitably contain interventions in the writing that nineteenth-century editors felt they ought to make. For instance, they tend to invent state directions that help the reader to imagine the location that a scene is set, for instance "Elsinore. The battlements late at night". Such locutions were almost never used by professional dramatists in Shakespeare's time. Nineteenth-century editions also tend to mix modernized spellings of certain words (so 'goe' and 'doe' from an early edition become 'go' and 'do' in the modern edition) with unmodernized forms of other words (so that 'murther' for 'murder' and 'apricocks' for 'apricots' are retained). Today's editions made today are more consistent and more attuned to the forms of early modern theatrical scripts than nineteenth-century ones.

It might seem that the obvious thing for us to do is use digital transcriptions of the early editions, those made in Shakespeare's lifetime and shortly after. There certainly are examples of these on the WorldWide Web, the most complete and accurate being the exceptionally good transcriptions of the quartos and Folio provided by the project Internet Shakespeare Editions. But there are problems using the early editions directly for our kind of work. If we are counting the frequencies of particular words we do not want our computer to separate out the counts for the spelling 'door' and 'doore' (both of which appear in the early editions) merely because one has an 'e' on the end: these are just two equally acceptable spellings of the same word.

Furthermore, unless we take special steps, computers will just count strings of letters and not 'words' as linguists understand that term. Consider the letters 'r', 'o', and 'w' coming together to form the string 'row'. That string of three letters is the spelling for a number of different words in English: it is the spelling for the verb meaning to propel a boat, for the noun meaning the opposite of a column, for the verb meaning to argue, and for the noun meaning an argument. These words are known as 'homographs': different words with the same spelling. Ideally, we would want to count each occurrence of the string 'row' in Shakespeare according to which of those words it stands for at that point in each play.

What we really want for our work is a digital text of Shakespeare in which the forms and spellings of the early editions (those from Shakespeare's lifetime and just after) are present but also present, if they are wanted, are the modernized and regularized forms of the same words. We want the best of both worlds. Fortunately this has now been created by Hugh Craig and his team at the Centre for Literary and Linguistic Computing (CLLC) at the University of Newcastle in Australia. Using eXtensible Markup Language (XML), Craig's team have created digital texts of hundreds of early modern plays -- including all of Shakespeare's in multiple early editions -- in which the additional 'markup' (or 'tagging' shows the differences between old and new forms. For instance, consider this fragment from the play Thomas Lord Cromwell:

take my <reg orig="afternoones">afternoon's</reg> nap

The pair of tags around the modernized form "afternoon's" record that in the 1602 edition of the play from which this transcription is taken the form is "afternoones". Consider this fragment from the same play:

my old master <seg type="homograph" subtype="verb">will</seg> be stirring

Here, the tagging shows that "will" is a homograph (that is, there are multiple different words that have this spelling) and that this particular occurrence at this point in this play is a verb. Where did this tagging come from? Most of it was created by automated processes using the software tool 'Variant Detector (VARD)' created by the linguists at the University of Lancaster, which regularizes spellings, and the software tool 'MorphAdorner' created by the Northwestern University Information Technology (NUIT) research group, which determines which part-of-speech (here, a verb) a particular word occurrence belongs to.

It is a trivial technical matter to extract from such an XML-encoded transcription either the modernized version of the words of a play (just take the strings between the <reg>...</reg> tags) or the original spelling words of a play (just take the value of the @orig attribute given in the opening tag of the <reg> element. Likewise it is trivial to extract for each word its part-of-speech it belongs to: just extract the @homograph and @subtype attributes from the opening tag of the <seg> elements.

Our Data You Can Download

We have only just begun our project and so far we have made tentative experiments on the quarto and Folio editions of Shakespeare's plays, using some simple information-theoretic measures to get a sense of how different or alike they seem by these measures. The files we have worked on so far are:

Hugh-Craig-318-plays-in-XML.zip
This is a set of 318 early modern editions of 288 early modern plays by 77 authors kindly supplied to us by Hugh Craig. Where more than one author formed a team to write a play, we count each team once so that 'Dekker and Webster' counts as one author and 'Dekker' on his own counts as another author. By far the greatest number of editions is for plays by Shakespeare, either on his own or in a team, because for many of his plays this set provides multiple editions, being one or more quartos and the Folio edition. Thus this set contains 63 editions of 39 Shakespeare plays. The only other authors for whom this set contains multiple editions of one play are these:

AUTHOR: Ben Jonson
TITLE: Every Man in His Humour
EDITIONS: 1601 quarto; 1616 Folio

AUTHOR: Ben Jonson
TITLE: Poetaster
EDITIONS: 1602 quarto; 1616 Folio

AUTHOR: Thomas Kyd
TITLE: The Spanish Tragedy
EDITIONS: 1592 quarto; 1602 quarto

AUTHOR: Christopher Marlowe
TITLE: Doctor Faustus
EDITIONS: 1604 quarto; 1616 quarto

AUTHOR: Thomas Middleton
TITLE: A Game at Chess
EDITIONS: 1625 third quarto; Trinity Manuscript

AUTHOR: Uncertain
TITLE: Mucedorus
EDITIONS: 1598 quarto; 1610 quarto

Shakespeare-Folio-only-plays-in-XML-corrected-by-GE.zip
This is a subset of Hugh Craig's set of 318 early modern play editions containing Shakespeare's plays that were first published in the 1623 Folio. Gabriel Egan made some minor corrections to the XML from Craig's set.

Shakespeare-Quartos-and-their-Folio-counterparts-in-XML-corrected-by-GE.zip
This is the subset of Hugh Craig's set of 318 early modern play editions containing Shakespeare's plays that were first published in quarto form before the publication of the 1623 Folio. Gabriel Egan made some minor corrections to the XML from Craig's set. [Note to self: why did I omit The Two Noble Kinsmen quarto of 1634?]

Ours is a publicly funded project so everything we make is free for anyone else to download. Hugh Craig has kindly made his set of XML transcriptions of plays available under a CC BY 4.0 license.

Our Software You Can Download

For our first experiments we wanted just the modernized forms of the words in the Shakespeare play quarto and Folio editions, omitting everything else (so leaving out the stage directions and the speech prefixes). The following script does that for us:

strip-XML.py This is a Python script written by Gabriel Egan that inputs any of the files from Hugh Craig's set of 318 files and outputs just the words of dialogue (so, not stage directions or speech prefixes) in their modernized forms.

Our Funders

We gratefully acknowledge support from our employer De Montfort University and (via APEX award APX\R1\241032) the Royal Society, the British Academy, the Royal Academy of Engineering, and the Leverhulme Trust.