Wednesday, January 9, 2013

Whose Lexicon? The Impact of Reporters and Editors on the Old Bailey Proceedings

A few weeks ago I was discussing early modern vocabulary with Tim Hitchcock (as one does on a Wednesday evening). If I recall correctly, he felt that new words likely appeared at roughly the same rate as old words disappeared from the language. In essence, we're not getting a bigger vocabulary, we're just using an ever-shifting one. Ben Schmidt's blog posts on "Age Cohort and Vocabulary Use" and "Predicting Publication Date and Generational Vocabulary Shift" would tend to support this idea. The basis for Schmidt's article was an analysis of the age of authors and how their age impacted the way they used words in 19th century literature. Schmidt found that people learn to use language in a certain way in their youth and tend not to change those patterns very much as they age. This accounts for the slightly different languages whippersnappers and their grandparents speak, even to this day.

I decided to see what I could find out about vocabulary use by digging into the Old Bailey Online (OBO). As many of you undoubtedly know, the OBO is a wonderful corpus of electronic text for anyone interested in Early Modern London. The OBO is an electronic XML version of the Proceedings of the Old Bailey, an abridged transcription of what was said in court for each case held in the Old Bailey courtroom between 1674 and 1914. What we have is not an exact facsimile of every word spoken, but what Magnus Huber believes is “guided by” what was said in court, capturing the ideas, if not always the exact words of the speaker. Though not a perfect transcription of speech, Clive Emsley believes that we can put our faith in the events described in the Proceedings because “the Old Bailey Courthouse was a public place, with numerous spectators, and the reputation of the Proceedings would have quickly suffered if the accounts had been unreliable”. 

Originally intended as a profit-making venture, entertaining the masses with tales of woe in the courtroom, the Proceedings became the official record in 1778 and were required to present a “true, fair and perfect narrative”. Practical limits of course made this difficult. The Proceedings were created entirely without electronic recording devices by shorthand reporters. Many trials therefore appear in significantly condensed form, such as the six-hour trial of Charles Stokes and company from 1787 that is recorded in only 468 words in the published version. Unfortunately, I do not have Huber's annotated version of the OBO corpus. I do however, have the full set of XML files, downloaded from the Old Bailey Online website (to learn how to do this check out the Programming Historian 2). I decided to focus on the records between the beginning in 1674 and the foundation of the Central Criminal Court in 1834. That gave me 161 years worth of early modern transcripts for just over 100,000 trials and 51 million words.

What I found was that even with such a wonderful resource we cannot be sure what people said to one another and how closely those speech events relate to the written records we have left. None of us knows how to speak like an eighteenth century Englishman, and based on my digging I don't think that the OBO can teach us how. That's because the vocabulary used in the OBO is the vocabulary of the people who recorded and edited the trial transcript and not of a wider communal lexicon. Instead of looking into vocabulary, I quickly realized I needed to be thinking about vocabularies. When it comes to the original assumption, we're not getting a bigger vocabulary, we're just using an ever-shifting one, the real question is: who is "we"?

Before coming to this conclusion, I looked at the rate of unique words entering the corpus over time. When I calculated this there were more than 120,000 unique words in the corpus, the introduction of which can be seen in Figure 1, showing a surprisingly linear increase. I should note that I'm defining a word as a unique string of characters. "word" and "words" are two different words for my purposes - that is, the corpus was not lemmatized.
Figure 1: The size of the lexicon int he Old Bailey Online [enlarge+]
Since I work on immigration to London this got me wondering if what we're actually seeing is not necessarily a growth in the vocabulary, but a diversification of names present in the metropolis. Trial transcripts typically discuss people, so we do get a lot of names popping up. New names can come from new people arriving in the city with unique names that no one else had, or when someone pronounces their name with a thick enough accent that a phonetic variation appears in the corpus as a new word (Callaghan, Calaghan, Callagan, Colligan, Calahan, Callaham, Callahan, Calleghan). Early modern parents were still largely not very adventurous with their names so we don't see a lot of children named "Apple" or "Harper Seven" and so creativity is not likely going to be a big factor driving the growth of new names. The OBO XML tags make this relatively easy to test. The "persName" tags allowed me to identify and extract all words used to describe people. As it happened, there were roughly 60,000 of these such names - about half of the entire lexicon.

I also decided to check if the new words represented use of English words, or if they were archaeic spellings, acronyms, or otherwise misspelled words, so I ran the entire set through a series of four English-language dictionaries to see if we were dealing with recognizeable words. These dictionaries included 90,000 unique "words", including lemmatized variations. Each word could therefore be a "dictionary word", a "name word", a combination of both, or neither.

The results can be seen in Figure 2, which shows what category the new words fall into, graphed over time along the same scale on the y-axis.

Figure 2: The size of the lexicon at each trial session, broken down by the category of word [enlarge+]
The results are, I think, interesting. I'll discuss the "Names List" and the "No List" entries in future posts. The bottom left graph is words that appear both as names and as words in the dictionary. This includes words such as "green", which can be both a person's name and a word describing the colour of something. I've decided not to disambiguate between the two on a word-by-word basis for time reasons.

For now, I'd like to focus on the dictionary terms. Dictionary words seem to be on the rise throughout the eighteenth century. The reason for this may in fact be a slight growth in the lexicon, but I think more likely is a tendency towards increased standardization in the spelling of English words. Samuel Johnson's Dictionary appears in print for the first time in 1755, so this is the age of standardization in lexicography. As anyone who has ever tried to read a seventeenth century text knows, people spelled (spelt?) words differently back then. Over time the "accompts" of criminal activity transform into "accounts". People stop committing "burghlary" and are instead charged with "burglary". This of course occurs shortly before the last "souldiers" are sent off to war.

My four "dictionaries" were built for modern use rathern than seventeenth or eighteenth century vocabulary, which means the figures above are a better indicator of when standardized spelling of English words were adopted than they are measures of the lexicon. "Burghlary" should really be counted as a variation of "burglary" rather than as a unique word in the "no list" category. A linguist would tell me that I should have lemmatized my corpus.

But don't dispair, not all is lost. I think we still can learn a thing or two about the lexicon as well as the OBO records themselves from this analysis. In Figure 3 you can see the number of new dictionary words per year introduced into the corpus.

Figure 3: The number of new words per year introduced into the OBO corpus [enlarge+].

The number of new terms is highest in the early years. This makes perfect sense as the corpus size starts at zero on the date of the first trial. Before a word appears in the corpus someone has to use it, and that takes time. The big peak in 1689 is an anomaly caused by a single account that was much longer than typically found at the time. Most big peaks can be traced to particularly long accounts, since these generally represent a reporter going into much more detail and therefore using a wider vocabulary. The dip in the early years of the 18th century represent a series of particularly short accounts, as well as some missing accounts. Where it all gets interesting for me is at the first arrow around 1715.

What we see at this first mark is a fairly high number of new dictionary words appearing each year until about the 1740s. The number of new words around 1715 may be artificially high, since we do seem to be missing entries in the previous decade and presumably some of those new words would have appeared earlier if we had the records. Nevertheless, there does appear to be more new words than normal in the following two decades. Perhaps surprisingly, the publication of Johnson's Dictionary at the second arrow marker, is actually a low point for new word growth in the middle of the eighteenth century. This to me suggests that Johnson was in many respects responding to generally accepted norms of spelling and word use rather than driving the adoption of such uses.

The last arrow is for me the most interesting. The number of new words per year again increases significantly just after 1778, which as I mentioned above was the date that the Proceedings of the Old Bailey became a "true, fair and perfect narrative" - an official record of courtroom activity. Perhaps this shift from a popular to an official record meant a significant change in what went into a trial account. That does seem to be part of the answer. The length of the Proceedings does slowly start to increase, starting in 1783 when the graph jumps upwards.

The fact that there are spikes in new words every time a long transcript appears reinforces the fact that the Proceedings are using a specific vocabulary - one related to criminal justice - as opposed to a vocabularly that's representative of the entire active English lexicon. But the spikes in new words, as well as the number of words in a trial transcript, can tell us even more, if we look at who was writing those words down.

As mentioned previously, Magnus Huber believes the Proceedings are "guided by" what was said in court. Before a spoken word appeared in the Proceedings the original speech event was converted to shorthand by a courtroom reporter and were converted back to prose by the workers in the print shop before being committed to paper. Words represent the attempt of one person to communicate an idea to another. As above, "burghlary" and "burglary" refer to the same idea. The difference between the two spellings is merely a choice in how to record the sounds using letters.

So what effect does changing the scribe have on the rate of new dictionary words entering the corpus? Apparently, quite a lot. I've located the names of the scribes from 1749 onwards in Huber's article, "The Old Bailey Proceedings, 1674-1834: Evaluating and annotating a corpus of 18th- and 19th-century spoken English". When we look at the number of new dictionary words each scribe introduces into the corpus (Figure 4), we see it's certainly not even across the board.

Figure 4: The number of new dictionary words per session added to the corpus, coloured by courtroom reporter [enlarge+].

I recognize there's a lot going on there, so I'll break this down into chunks that are easier to see. What I'd like to draw your attention to is the fact that some scribes increase the size of the corpus significantly, and others do not.

Firstly, let's look at Hodgson, who was holding the transcribers pen from 1782 to 1792 (Figure 5).

Figure 5: The number of new dictionary words in the Old Bailey corpus from 1781 to 1795, coloured by courtroom reporter [enlarge+].

Though W Blanchard was only on the job for a few months, it's quite clear that E Hodgson was on average adding more new words to the corpus each month than had his predecessor. The number of new words Hodgson adds starts slowly, but then picks up rather dramatically before tailing off towards the end of his tenure. Hodgson's "reign" so to speak also overlaps with the significant growth in the size of the Proceedings mentioned above. From the graph and the trendlines, we might make the following conclusions:
  • Hodgson was verbose and reported more than his predecessors
  • He had a larger than average vocabulary that he happily shared
  • After a few years his "used up" his vocabulary and ceased to find as many new words
Hodgson has therefore influenced both the vocabulary used in the Proceedings as well as the length of the resultant documents. We would be naive therefore to suggest that Hodgson was an impartial observer or that the Proceedings during this period are anything but the output of Hodgson himself. The records were not "created"; they were created by Hodgson. The way Hodgson created the records was distinct from how the others did so.

Moving forward slightly in time, let's consider the next three scribes who wrote between 1792 and 1801 (Figure 6).

Figure 6: The number of new dictionary words in the Old Bailey corpus from 1792 to 1801, coloured by courtroom reporter [enlarge+].

The effect of different writers here is perhaps more obvious. Silby doesn't seem to be one for new words, whereas Marson and Ramsay in the blue definitely are. The fact that Ramsay stays on afterwards and the growth of the vocabulary is stunted thereafter suggests that it was Marson driving this change. It's becoming clear that anyone working with these records should be wary of who was responsible for creating them in the first place.

The last section I'd like to highlight is the final one from 1816 to 1834 when a single scribe named H Buckler was on the job. However, Buckler worked under a series of editors as can be seen in Figure 7.

Figure 7: The number of new dictionary words int he Old Bailey corpus from 1816 to 1834, coloured by editor [enlarge+].
In this case, the scribe stays the same yet we still see patterns that seem to make more sense when we know there's a different editor publishing the Proceedings. Clearly when Stokes takes over in 1828 there's a change to the resultant lexicon that spikes up, presumably after he had become confident in his new role after a few months on the job. This set is particularly interesting because from what we can tell, the scribe H Buckler doesn't seem to be the one driving the adoption of new words. From 1816 to 1828 he's one of the less ambitious in this category, but that begins to change as new editors take over.


How does this all tie back with my origional questions about vocabulary in the early modern era? Well, first of all, I failed to test what I set out to understand. Because I did not lemmatize my corpus I was not able to determine if we do in fact have a growing vocabulary or a shifting one. What I should have looked at was a moving window of word use, calculating how many words were used in a given ten-year period. I'll leave that for another day or another researcher to take on if they feel so inclined. My suspicion is that we have both a growing and shifting vocabulary. Words are falling into disuse or at least out of regular use. I imagine that Ben Schmidt's analysis explains most of that shift: young kids don't learn - or at least don't use - the same words as their parents or grandparents. Words die one funeral at a time.
The reason I wasn't able to see a shifting vocabulary using the OBO corpus is because the OBO does not represent a single person's vocabulary. Instead, it roughly represents the combined vocabulary of people who appear one way or another in the Old Bailey courtroom to discuss matter of law and justice, filtered through a courtroom reporter and an editor. As I discovered, those last two have a much bigger impact on the corpus than we might have liked to imagine. For anyone who looks at the language of the court or indeed the format of the transcripts by using the OBO corpus, I'd urge you to keep the impact of those reporters and editors in mind. For anyone studying academic history, do note that what seems like a window into the past, is in fact the product of a few individuals who made decisions they may not even have been aware of that impacted what was recorded, what was not, and what words they used to do so.


Ben Schmidt said...

Very interesting stuff. As an overall rule for the initial question, I'd have to say that vocabulary does expand over time even if individual vocabs stay the same, b/c of specialization—this is basically what the Culturomics paper in Science found when it looked at vocab size, and it makes sense intuitively ('lemmatized' and 'collocation' are some pretty goofy words to know).

I think you're exactly right to be looking at the influence of particular editorial regimes--one of the beautiful things about historians putting together corpuses in addition to the linguistics ones we already have is that they (we) tend to put together corpuses where it's really possible to use the language as a lens into a particular institution, rather than the other way around.

Very much looking forward to the names list post, I've been thinking a lot about names recently.

Two comments: I think the yearly resolution makes almost any pattern possible—would be curious if the intuition that the new editors add new words help up comparing all the paired results for first five years of a new editor/scribe to the last five years, say. And don't be too apologetic about not lemmatizing—the tools out there are imperfect (they won't handle early modern spelling, eg), and the debut of new forms is often indicative of new meanings (verbal forms of nouns, for example).

Adam Crymble said...

Thanks for the comment Ben. Now that I think about it I think you're right about the lemmatizing. It would be nice though to be able to pair early modern spellings like Burghlary with their modern equivalent, since those do represent the same idea.

Regarding the resolution and patterns, I like your suggestion of looking for statistical patterns. However, most scribes have very short tenures - many a year or less - so it's tough to compare them to each other beyond an impressionistic look at the graphed data. The Old Bailey records were printed roughly 8 times per year, so the number of datapoints per scribe is typically quite low. There are also surely confounding factors that I'd need to take into account such as new laws coming into force which brought with them new vocabularies. In those cases it's not the scribe, but the law that dictates what words were used to describe the event. It's something I'll look into though.

Gino Roy said...

Generally in most of the cases it usually make lots of impacts and surely by the time one must needed to bring around all those possible values. extended essay title page

aliya seen said...

Editing your actual work with regards to language use, flow and word choice.
With our step-by-step guides you will methodically work through your papers to highlight and eliminate errors as well as making suggestions as to how your work can be significantly improved free essay editing