Showing posts with label Old Bailey Online. Show all posts
Showing posts with label Old Bailey Online. Show all posts

Sunday, February 10, 2013

Identifying and Fixing Transcription Errors in Large Corpuses

"Underwood 11 Typewriter", by Alex Kerhead.
This is the third post in my series on the Old Bailey Online (OBO) corpus. In previous posts I looked at the impact of courtroom reporters and editors on the vocabulary used in the Old Bailey trial transcripts, and at ways of measuring the diversity of immigration in London between the 1680s and 1830s.

Since I'm dealing with a huge amount of text (51 million words, 100,000 trials), I thought I'd turn my attention to the accuracy of the transcription. For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.

But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?

Well as you may have read in the previous posts, I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I've called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.

Not all of the words on the unidentified list are in fact errors. Many are archaic verb conjugations or spellings (catched – 1,657 uses or forraign – 1 use), compound words (shopman – 4,036 or watchhouse – 2,661), London place names (Houndsditch – 877), uncommon names that had not been marked up as such during the XML tagging process (Woolnock – 1), Latin words (paena – 1), or abbreviations (knt – 1,921) – short for “knight”, a title used by many gentlemen in the eighteenth century.

On the other hand, many of these words are clearly errors. We see mistyped letters as in “insluence” instead of “influence” or “doughter” instead of “daughter”. We also see transposed letters as in “sivler” instead of “silver”. And there are missing letters: “Wlliam” instead of “William”. Finding the difference between the real words such as “watchhouse” and the errors such as “Wlliam” amongst the 43,000 terms on the unidentified list is the real challenge.

Checking manually is impractical as these terms appear nearly 200,000 times in the corpus. Correcting every single error might not be worth the effort. However, to get an idea for the types of errors we see appearing and in what proportions, I checked every entry on the unidentified list against the image of the original scanned record during a single session of the court: January 1800. The unidentified words fell into the categories seen in Figure 1.

Figure 1: January 1800 Old Bailey Online transcription errors and the type of error.
The most surprising category here for me is the purple section, which showcases three instances that I would have categorized as typos by the transcribers, but which were actually typos in the original source. This compounds the problem because it means we must acknowledge that in some instances the error is not with the OBO team but is in fact reflecting the content of the contemporary document. From the perspective of a person searching for a particular keyword in the database they may be frustrated by the original error. On the other hand, from the perspective of those who want to be true to the original, that mistake should be preserved. I won't weigh in on that particular issue here, but it is something anyone working to correct transcription errors should consider.

With this in mind we can begin to look at the other categories, and by the looks of things approximately 40% of entries can in theory be corrected if we can figure out the intended word. Admittedly, I only looked at a single session of the trials, and this may not be representative - particularly if we consider Early Modern English, which might lead us to believe earlier trials are more likely to have archaic non-standardized spellings. If however the session from 1800 is roughly representative of a typical session then we should expect to find somewhere in the neighbourhood of 15,000-20,000 errors.

What can we do about it?

How can we automatically find and correct those errors? Given the fail-safes put in place by the double rekeying process, it's already incredibly unlikely that we will find typing errors by the transcribers. That means when we do encounter such errors it's likely only going to happen once or twice, meaning most errors are probably words that appear only once or twice in the corpus and that do not appear on either the dictionary list or the surname list.

That's not to say of course that just because a word appears in the dictionary that it is not transcribed incorrectly; however, at this stage it is much easier to identify those errors that are not recognized words. Unfortunately there are over 30,000 unique words on the unidentified list that appear only once, meaning this is still impractical to explore manually. Luckily the double rekeying means that any mistakes are more likely to be a matter of the transcriber interpreting the marks on the page differently than we might have liked them to than it is a case of fat fingers hitting the wrong key.

The early modern “long S” is the perfect such example. In the early modern era, up to about 1820, it was entirely common to find the letter S represented as what we might think looks like a lower-case “f”. This is the “suck” vs “fuck” problem that the Google N-Grams viewer runs into, as a slew of esses are interpreted as efs. When viewing the result one might be tempted to conclude people had quite a potty mouth on them in the early nineteenth century, as can be seen in Figure 2. Though not necessarily an incorrect assumption, it wouldn't be wise to make the assumption on this particular evidence.

Figure 2: Google N-Gram results for "suck" and "fuck" in the early nineteenth century

When we look through many of the words on the unidentified list it becomes clear that the Long S is a substantial problem. We find examples of the following:
  • abufes
  • afcertained
  • assaffin
  • affaulting
  • affize 
Or, the other way around:
  • assair
  • assixed
  • assluent
  • asorethought
  • artisice 
By writing a Python program that changed the letter F to an S and vise versa, I was able to check if making such a change created a word that was in fact an English word. When I did this I was pointed to several thousand possible typos. As I inspected the list further I noticed there were other common errors probably caused by the very high contrast scans of the original documents. These original documents often included missing parts of letters, difficult to read words, or little bits of dirt or smudges that made interpreting the marks more challenging.

Some of the most obvious switches were:
  • F / S 
  • I / L 
  • U / N
  • C / E
  • A / O
  • S / Z
  • V / U 
Why these particular switches appeared again and again I'm not entirely sure. Some of them are easy to understand: the lower-case C and lower-case E are easy to mix up. Especially when a fleck of dirt shows up in just the right spot on the scan. Others are a bit more difficult to explain, as with U and N, which we wouldn't expect an automated optical character recognition program to have trouble with, but which seems to have stumped the human transcribers repeatedly.

By running these seven sets of letters through the program and testing the results against the English dictionaries I was able to come up with 2,780 suggested corrections. If these are all correct, that simple switching would correct 9,503 typos in the OBO corpus. The results of these changes broken down by letter-pair can be seen in Figure 3.

Figure 3: The number of suggested corrections in the OBO corpus by switching letter pair combinations in misspelled words.
I say suggested corrections because in some cases the switch is actually wrong, or may be wrong. The English dictionaries missed "popery", a common term used to refer to Roman Catholics in the eighteenth century and has instead suggested the unlikely "papery" as an alternative. In 86 cases the switching has come up with two possible suggestions, both of which are English words, at least one of which is obviously incorrect. The unidentified word "faucy" could be "saucy" or "fancy". Turns out it's saucy, referring to the behaviour of a Peter Dayley - that naughty boy.

This switcheroo method will not solve all problems. It cannot fix transposed letters, as with sivler and silver; Levenstein distance is likely needed for that. It does nothing for missing letters as in Wlliam. But it does take us well along the path to making some rather dramatic improvements with a very reasonable amount of effort, and I would argue, could be an economical way to improve the accuracy of projects which have already been transcribed but which suffer from accuracy issues. As with all great things in life this algorithm still requires a human's careful eye, but at least it has pointed that eye in the right direction. And when you're looking at 51 million words of text, that's nine-tenths of the battle.

If you're working on a project that could use some accuracy improvements, or have explored other ways of achieving similar results, I'd be very happy to hear from you.

Wednesday, January 9, 2013

Whose Lexicon? The Impact of Reporters and Editors on the Old Bailey Proceedings

A few weeks ago I was discussing early modern vocabulary with Tim Hitchcock (as one does on a Wednesday evening). If I recall correctly, he felt that new words likely appeared at roughly the same rate as old words disappeared from the language. In essence, we're not getting a bigger vocabulary, we're just using an ever-shifting one. Ben Schmidt's blog posts on "Age Cohort and Vocabulary Use" and "Predicting Publication Date and Generational Vocabulary Shift" would tend to support this idea. The basis for Schmidt's article was an analysis of the age of authors and how their age impacted the way they used words in 19th century literature. Schmidt found that people learn to use language in a certain way in their youth and tend not to change those patterns very much as they age. This accounts for the slightly different languages whippersnappers and their grandparents speak, even to this day.

I decided to see what I could find out about vocabulary use by digging into the Old Bailey Online (OBO). As many of you undoubtedly know, the OBO is a wonderful corpus of electronic text for anyone interested in Early Modern London. The OBO is an electronic XML version of the Proceedings of the Old Bailey, an abridged transcription of what was said in court for each case held in the Old Bailey courtroom between 1674 and 1914. What we have is not an exact facsimile of every word spoken, but what Magnus Huber believes is “guided by” what was said in court, capturing the ideas, if not always the exact words of the speaker. Though not a perfect transcription of speech, Clive Emsley believes that we can put our faith in the events described in the Proceedings because “the Old Bailey Courthouse was a public place, with numerous spectators, and the reputation of the Proceedings would have quickly suffered if the accounts had been unreliable”. 

Originally intended as a profit-making venture, entertaining the masses with tales of woe in the courtroom, the Proceedings became the official record in 1778 and were required to present a “true, fair and perfect narrative”. Practical limits of course made this difficult. The Proceedings were created entirely without electronic recording devices by shorthand reporters. Many trials therefore appear in significantly condensed form, such as the six-hour trial of Charles Stokes and company from 1787 that is recorded in only 468 words in the published version. Unfortunately, I do not have Huber's annotated version of the OBO corpus. I do however, have the full set of XML files, downloaded from the Old Bailey Online website (to learn how to do this check out the Programming Historian 2). I decided to focus on the records between the beginning in 1674 and the foundation of the Central Criminal Court in 1834. That gave me 161 years worth of early modern transcripts for just over 100,000 trials and 51 million words.

What I found was that even with such a wonderful resource we cannot be sure what people said to one another and how closely those speech events relate to the written records we have left. None of us knows how to speak like an eighteenth century Englishman, and based on my digging I don't think that the OBO can teach us how. That's because the vocabulary used in the OBO is the vocabulary of the people who recorded and edited the trial transcript and not of a wider communal lexicon. Instead of looking into vocabulary, I quickly realized I needed to be thinking about vocabularies. When it comes to the original assumption, we're not getting a bigger vocabulary, we're just using an ever-shifting one, the real question is: who is "we"?

Before coming to this conclusion, I looked at the rate of unique words entering the corpus over time. When I calculated this there were more than 120,000 unique words in the corpus, the introduction of which can be seen in Figure 1, showing a surprisingly linear increase. I should note that I'm defining a word as a unique string of characters. "word" and "words" are two different words for my purposes - that is, the corpus was not lemmatized.
Figure 1: The size of the lexicon int he Old Bailey Online [enlarge+]
Since I work on immigration to London this got me wondering if what we're actually seeing is not necessarily a growth in the vocabulary, but a diversification of names present in the metropolis. Trial transcripts typically discuss people, so we do get a lot of names popping up. New names can come from new people arriving in the city with unique names that no one else had, or when someone pronounces their name with a thick enough accent that a phonetic variation appears in the corpus as a new word (Callaghan, Calaghan, Callagan, Colligan, Calahan, Callaham, Callahan, Calleghan). Early modern parents were still largely not very adventurous with their names so we don't see a lot of children named "Apple" or "Harper Seven" and so creativity is not likely going to be a big factor driving the growth of new names. The OBO XML tags make this relatively easy to test. The "persName" tags allowed me to identify and extract all words used to describe people. As it happened, there were roughly 60,000 of these such names - about half of the entire lexicon.

I also decided to check if the new words represented use of English words, or if they were archaeic spellings, acronyms, or otherwise misspelled words, so I ran the entire set through a series of four English-language dictionaries to see if we were dealing with recognizeable words. These dictionaries included 90,000 unique "words", including lemmatized variations. Each word could therefore be a "dictionary word", a "name word", a combination of both, or neither.

The results can be seen in Figure 2, which shows what category the new words fall into, graphed over time along the same scale on the y-axis.

Figure 2: The size of the lexicon at each trial session, broken down by the category of word [enlarge+]
The results are, I think, interesting. I'll discuss the "Names List" and the "No List" entries in future posts. The bottom left graph is words that appear both as names and as words in the dictionary. This includes words such as "green", which can be both a person's name and a word describing the colour of something. I've decided not to disambiguate between the two on a word-by-word basis for time reasons.

For now, I'd like to focus on the dictionary terms. Dictionary words seem to be on the rise throughout the eighteenth century. The reason for this may in fact be a slight growth in the lexicon, but I think more likely is a tendency towards increased standardization in the spelling of English words. Samuel Johnson's Dictionary appears in print for the first time in 1755, so this is the age of standardization in lexicography. As anyone who has ever tried to read a seventeenth century text knows, people spelled (spelt?) words differently back then. Over time the "accompts" of criminal activity transform into "accounts". People stop committing "burghlary" and are instead charged with "burglary". This of course occurs shortly before the last "souldiers" are sent off to war.

My four "dictionaries" were built for modern use rathern than seventeenth or eighteenth century vocabulary, which means the figures above are a better indicator of when standardized spelling of English words were adopted than they are measures of the lexicon. "Burghlary" should really be counted as a variation of "burglary" rather than as a unique word in the "no list" category. A linguist would tell me that I should have lemmatized my corpus.

But don't dispair, not all is lost. I think we still can learn a thing or two about the lexicon as well as the OBO records themselves from this analysis. In Figure 3 you can see the number of new dictionary words per year introduced into the corpus.


Figure 3: The number of new words per year introduced into the OBO corpus [enlarge+].

The number of new terms is highest in the early years. This makes perfect sense as the corpus size starts at zero on the date of the first trial. Before a word appears in the corpus someone has to use it, and that takes time. The big peak in 1689 is an anomaly caused by a single account that was much longer than typically found at the time. Most big peaks can be traced to particularly long accounts, since these generally represent a reporter going into much more detail and therefore using a wider vocabulary. The dip in the early years of the 18th century represent a series of particularly short accounts, as well as some missing accounts. Where it all gets interesting for me is at the first arrow around 1715.

What we see at this first mark is a fairly high number of new dictionary words appearing each year until about the 1740s. The number of new words around 1715 may be artificially high, since we do seem to be missing entries in the previous decade and presumably some of those new words would have appeared earlier if we had the records. Nevertheless, there does appear to be more new words than normal in the following two decades. Perhaps surprisingly, the publication of Johnson's Dictionary at the second arrow marker, is actually a low point for new word growth in the middle of the eighteenth century. This to me suggests that Johnson was in many respects responding to generally accepted norms of spelling and word use rather than driving the adoption of such uses.

The last arrow is for me the most interesting. The number of new words per year again increases significantly just after 1778, which as I mentioned above was the date that the Proceedings of the Old Bailey became a "true, fair and perfect narrative" - an official record of courtroom activity. Perhaps this shift from a popular to an official record meant a significant change in what went into a trial account. That does seem to be part of the answer. The length of the Proceedings does slowly start to increase, starting in 1783 when the graph jumps upwards.

The fact that there are spikes in new words every time a long transcript appears reinforces the fact that the Proceedings are using a specific vocabulary - one related to criminal justice - as opposed to a vocabularly that's representative of the entire active English lexicon. But the spikes in new words, as well as the number of words in a trial transcript, can tell us even more, if we look at who was writing those words down.

As mentioned previously, Magnus Huber believes the Proceedings are "guided by" what was said in court. Before a spoken word appeared in the Proceedings the original speech event was converted to shorthand by a courtroom reporter and were converted back to prose by the workers in the print shop before being committed to paper. Words represent the attempt of one person to communicate an idea to another. As above, "burghlary" and "burglary" refer to the same idea. The difference between the two spellings is merely a choice in how to record the sounds using letters.

So what effect does changing the scribe have on the rate of new dictionary words entering the corpus? Apparently, quite a lot. I've located the names of the scribes from 1749 onwards in Huber's article, "The Old Bailey Proceedings, 1674-1834: Evaluating and annotating a corpus of 18th- and 19th-century spoken English". When we look at the number of new dictionary words each scribe introduces into the corpus (Figure 4), we see it's certainly not even across the board.

Figure 4: The number of new dictionary words per session added to the corpus, coloured by courtroom reporter [enlarge+].

I recognize there's a lot going on there, so I'll break this down into chunks that are easier to see. What I'd like to draw your attention to is the fact that some scribes increase the size of the corpus significantly, and others do not.

Firstly, let's look at Hodgson, who was holding the transcribers pen from 1782 to 1792 (Figure 5).

Figure 5: The number of new dictionary words in the Old Bailey corpus from 1781 to 1795, coloured by courtroom reporter [enlarge+].

Though W Blanchard was only on the job for a few months, it's quite clear that E Hodgson was on average adding more new words to the corpus each month than had his predecessor. The number of new words Hodgson adds starts slowly, but then picks up rather dramatically before tailing off towards the end of his tenure. Hodgson's "reign" so to speak also overlaps with the significant growth in the size of the Proceedings mentioned above. From the graph and the trendlines, we might make the following conclusions:
  • Hodgson was verbose and reported more than his predecessors
  • He had a larger than average vocabulary that he happily shared
  • After a few years his "used up" his vocabulary and ceased to find as many new words
Hodgson has therefore influenced both the vocabulary used in the Proceedings as well as the length of the resultant documents. We would be naive therefore to suggest that Hodgson was an impartial observer or that the Proceedings during this period are anything but the output of Hodgson himself. The records were not "created"; they were created by Hodgson. The way Hodgson created the records was distinct from how the others did so.

Moving forward slightly in time, let's consider the next three scribes who wrote between 1792 and 1801 (Figure 6).

Figure 6: The number of new dictionary words in the Old Bailey corpus from 1792 to 1801, coloured by courtroom reporter [enlarge+].

The effect of different writers here is perhaps more obvious. Silby doesn't seem to be one for new words, whereas Marson and Ramsay in the blue definitely are. The fact that Ramsay stays on afterwards and the growth of the vocabulary is stunted thereafter suggests that it was Marson driving this change. It's becoming clear that anyone working with these records should be wary of who was responsible for creating them in the first place.

The last section I'd like to highlight is the final one from 1816 to 1834 when a single scribe named H Buckler was on the job. However, Buckler worked under a series of editors as can be seen in Figure 7.

Figure 7: The number of new dictionary words int he Old Bailey corpus from 1816 to 1834, coloured by editor [enlarge+].
In this case, the scribe stays the same yet we still see patterns that seem to make more sense when we know there's a different editor publishing the Proceedings. Clearly when Stokes takes over in 1828 there's a change to the resultant lexicon that spikes up, presumably after he had become confident in his new role after a few months on the job. This set is particularly interesting because from what we can tell, the scribe H Buckler doesn't seem to be the one driving the adoption of new words. From 1816 to 1828 he's one of the less ambitious in this category, but that begins to change as new editors take over.

Conclusions

How does this all tie back with my origional questions about vocabulary in the early modern era? Well, first of all, I failed to test what I set out to understand. Because I did not lemmatize my corpus I was not able to determine if we do in fact have a growing vocabulary or a shifting one. What I should have looked at was a moving window of word use, calculating how many words were used in a given ten-year period. I'll leave that for another day or another researcher to take on if they feel so inclined. My suspicion is that we have both a growing and shifting vocabulary. Words are falling into disuse or at least out of regular use. I imagine that Ben Schmidt's analysis explains most of that shift: young kids don't learn - or at least don't use - the same words as their parents or grandparents. Words die one funeral at a time.
The reason I wasn't able to see a shifting vocabulary using the OBO corpus is because the OBO does not represent a single person's vocabulary. Instead, it roughly represents the combined vocabulary of people who appear one way or another in the Old Bailey courtroom to discuss matter of law and justice, filtered through a courtroom reporter and an editor. As I discovered, those last two have a much bigger impact on the corpus than we might have liked to imagine. For anyone who looks at the language of the court or indeed the format of the transcripts by using the OBO corpus, I'd urge you to keep the impact of those reporters and editors in mind. For anyone studying academic history, do note that what seems like a window into the past, is in fact the product of a few individuals who made decisions they may not even have been aware of that impacted what was recorded, what was not, and what words they used to do so.