|"Underwood 11 Typewriter", by Alex Kerhead.|
Since I'm dealing with a huge amount of text (51 million words, 100,000 trials), I thought I'd turn my attention to the accuracy of the transcription. For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.
But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?
Well as you may have read in the previous posts, I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I've called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.
Not all of the words on the unidentified list are in fact errors. Many are archaic verb conjugations or spellings (catched – 1,657 uses or forraign – 1 use), compound words (shopman – 4,036 or watchhouse – 2,661), London place names (Houndsditch – 877), uncommon names that had not been marked up as such during the XML tagging process (Woolnock – 1), Latin words (paena – 1), or abbreviations (knt – 1,921) – short for “knight”, a title used by many gentlemen in the eighteenth century.
On the other hand, many of these words are clearly errors. We see mistyped letters as in “insluence” instead of “influence” or “doughter” instead of “daughter”. We also see transposed letters as in “sivler” instead of “silver”. And there are missing letters: “Wlliam” instead of “William”. Finding the difference between the real words such as “watchhouse” and the errors such as “Wlliam” amongst the 43,000 terms on the unidentified list is the real challenge.
Checking manually is impractical as these terms appear nearly 200,000 times in the corpus. Correcting every single error might not be worth the effort. However, to get an idea for the types of errors we see appearing and in what proportions, I checked every entry on the unidentified list against the image of the original scanned record during a single session of the court: January 1800. The unidentified words fell into the categories seen in Figure 1.
|Figure 1: January 1800 Old Bailey Online transcription errors and the type of error.|
With this in mind we can begin to look at the other categories, and by the looks of things approximately 40% of entries can in theory be corrected if we can figure out the intended word. Admittedly, I only looked at a single session of the trials, and this may not be representative - particularly if we consider Early Modern English, which might lead us to believe earlier trials are more likely to have archaic non-standardized spellings. If however the session from 1800 is roughly representative of a typical session then we should expect to find somewhere in the neighbourhood of 15,000-20,000 errors.
What can we do about it?
How can we automatically find and correct those errors? Given the fail-safes put in place by the double rekeying process, it's already incredibly unlikely that we will find typing errors by the transcribers. That means when we do encounter such errors it's likely only going to happen once or twice, meaning most errors are probably words that appear only once or twice in the corpus and that do not appear on either the dictionary list or the surname list.
That's not to say of course that just because a word appears in the dictionary that it is not transcribed incorrectly; however, at this stage it is much easier to identify those errors that are not recognized words. Unfortunately there are over 30,000 unique words on the unidentified list that appear only once, meaning this is still impractical to explore manually. Luckily the double rekeying means that any mistakes are more likely to be a matter of the transcriber interpreting the marks on the page differently than we might have liked them to than it is a case of fat fingers hitting the wrong key.
The early modern “long S” is the perfect such example. In the early modern era, up to about 1820, it was entirely common to find the letter S represented as what we might think looks like a lower-case “f”. This is the “suck” vs “fuck” problem that the Google N-Grams viewer runs into, as a slew of esses are interpreted as efs. When viewing the result one might be tempted to conclude people had quite a potty mouth on them in the early nineteenth century, as can be seen in Figure 2. Though not necessarily an incorrect assumption, it wouldn't be wise to make the assumption on this particular evidence.
|Figure 2: Google N-Gram results for "suck" and "fuck" in the early nineteenth century|
When we look through many of the words on the unidentified list it becomes clear that the Long S is a substantial problem. We find examples of the following:
Some of the most obvious switches were:
- F / S
- I / L
- U / N
- C / E
- A / O
- S / Z
- V / U
By running these seven sets of letters through the program and testing the results against the English dictionaries I was able to come up with 2,780 suggested corrections. If these are all correct, that simple switching would correct 9,503 typos in the OBO corpus. The results of these changes broken down by letter-pair can be seen in Figure 3.
|Figure 3: The number of suggested corrections in the OBO corpus by switching letter pair combinations in misspelled words.|
This switcheroo method will not solve all problems. It cannot fix transposed letters, as with sivler and silver; Levenstein distance is likely needed for that. It does nothing for missing letters as in Wlliam. But it does take us well along the path to making some rather dramatic improvements with a very reasonable amount of effort, and I would argue, could be an economical way to improve the accuracy of projects which have already been transcribed but which suffer from accuracy issues. As with all great things in life this algorithm still requires a human's careful eye, but at least it has pointed that eye in the right direction. And when you're looking at 51 million words of text, that's nine-tenths of the battle.
If you're working on a project that could use some accuracy improvements, or have explored other ways of achieving similar results, I'd be very happy to hear from you.