Showing posts with label transcription. Show all posts
Showing posts with label transcription. Show all posts

Sunday, February 10, 2013

Identifying and Fixing Transcription Errors in Large Corpuses

"Underwood 11 Typewriter", by Alex Kerhead.
This is the third post in my series on the Old Bailey Online (OBO) corpus. In previous posts I looked at the impact of courtroom reporters and editors on the vocabulary used in the Old Bailey trial transcripts, and at ways of measuring the diversity of immigration in London between the 1680s and 1830s.

Since I'm dealing with a huge amount of text (51 million words, 100,000 trials), I thought I'd turn my attention to the accuracy of the transcription. For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.

But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?

Well as you may have read in the previous posts, I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I've called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.

Not all of the words on the unidentified list are in fact errors. Many are archaic verb conjugations or spellings (catched – 1,657 uses or forraign – 1 use), compound words (shopman – 4,036 or watchhouse – 2,661), London place names (Houndsditch – 877), uncommon names that had not been marked up as such during the XML tagging process (Woolnock – 1), Latin words (paena – 1), or abbreviations (knt – 1,921) – short for “knight”, a title used by many gentlemen in the eighteenth century.

On the other hand, many of these words are clearly errors. We see mistyped letters as in “insluence” instead of “influence” or “doughter” instead of “daughter”. We also see transposed letters as in “sivler” instead of “silver”. And there are missing letters: “Wlliam” instead of “William”. Finding the difference between the real words such as “watchhouse” and the errors such as “Wlliam” amongst the 43,000 terms on the unidentified list is the real challenge.

Checking manually is impractical as these terms appear nearly 200,000 times in the corpus. Correcting every single error might not be worth the effort. However, to get an idea for the types of errors we see appearing and in what proportions, I checked every entry on the unidentified list against the image of the original scanned record during a single session of the court: January 1800. The unidentified words fell into the categories seen in Figure 1.

Figure 1: January 1800 Old Bailey Online transcription errors and the type of error.
The most surprising category here for me is the purple section, which showcases three instances that I would have categorized as typos by the transcribers, but which were actually typos in the original source. This compounds the problem because it means we must acknowledge that in some instances the error is not with the OBO team but is in fact reflecting the content of the contemporary document. From the perspective of a person searching for a particular keyword in the database they may be frustrated by the original error. On the other hand, from the perspective of those who want to be true to the original, that mistake should be preserved. I won't weigh in on that particular issue here, but it is something anyone working to correct transcription errors should consider.

With this in mind we can begin to look at the other categories, and by the looks of things approximately 40% of entries can in theory be corrected if we can figure out the intended word. Admittedly, I only looked at a single session of the trials, and this may not be representative - particularly if we consider Early Modern English, which might lead us to believe earlier trials are more likely to have archaic non-standardized spellings. If however the session from 1800 is roughly representative of a typical session then we should expect to find somewhere in the neighbourhood of 15,000-20,000 errors.

What can we do about it?

How can we automatically find and correct those errors? Given the fail-safes put in place by the double rekeying process, it's already incredibly unlikely that we will find typing errors by the transcribers. That means when we do encounter such errors it's likely only going to happen once or twice, meaning most errors are probably words that appear only once or twice in the corpus and that do not appear on either the dictionary list or the surname list.

That's not to say of course that just because a word appears in the dictionary that it is not transcribed incorrectly; however, at this stage it is much easier to identify those errors that are not recognized words. Unfortunately there are over 30,000 unique words on the unidentified list that appear only once, meaning this is still impractical to explore manually. Luckily the double rekeying means that any mistakes are more likely to be a matter of the transcriber interpreting the marks on the page differently than we might have liked them to than it is a case of fat fingers hitting the wrong key.

The early modern “long S” is the perfect such example. In the early modern era, up to about 1820, it was entirely common to find the letter S represented as what we might think looks like a lower-case “f”. This is the “suck” vs “fuck” problem that the Google N-Grams viewer runs into, as a slew of esses are interpreted as efs. When viewing the result one might be tempted to conclude people had quite a potty mouth on them in the early nineteenth century, as can be seen in Figure 2. Though not necessarily an incorrect assumption, it wouldn't be wise to make the assumption on this particular evidence.

Figure 2: Google N-Gram results for "suck" and "fuck" in the early nineteenth century

When we look through many of the words on the unidentified list it becomes clear that the Long S is a substantial problem. We find examples of the following:
  • abufes
  • afcertained
  • assaffin
  • affaulting
  • affize 
Or, the other way around:
  • assair
  • assixed
  • assluent
  • asorethought
  • artisice 
By writing a Python program that changed the letter F to an S and vise versa, I was able to check if making such a change created a word that was in fact an English word. When I did this I was pointed to several thousand possible typos. As I inspected the list further I noticed there were other common errors probably caused by the very high contrast scans of the original documents. These original documents often included missing parts of letters, difficult to read words, or little bits of dirt or smudges that made interpreting the marks more challenging.

Some of the most obvious switches were:
  • F / S 
  • I / L 
  • U / N
  • C / E
  • A / O
  • S / Z
  • V / U 
Why these particular switches appeared again and again I'm not entirely sure. Some of them are easy to understand: the lower-case C and lower-case E are easy to mix up. Especially when a fleck of dirt shows up in just the right spot on the scan. Others are a bit more difficult to explain, as with U and N, which we wouldn't expect an automated optical character recognition program to have trouble with, but which seems to have stumped the human transcribers repeatedly.

By running these seven sets of letters through the program and testing the results against the English dictionaries I was able to come up with 2,780 suggested corrections. If these are all correct, that simple switching would correct 9,503 typos in the OBO corpus. The results of these changes broken down by letter-pair can be seen in Figure 3.

Figure 3: The number of suggested corrections in the OBO corpus by switching letter pair combinations in misspelled words.
I say suggested corrections because in some cases the switch is actually wrong, or may be wrong. The English dictionaries missed "popery", a common term used to refer to Roman Catholics in the eighteenth century and has instead suggested the unlikely "papery" as an alternative. In 86 cases the switching has come up with two possible suggestions, both of which are English words, at least one of which is obviously incorrect. The unidentified word "faucy" could be "saucy" or "fancy". Turns out it's saucy, referring to the behaviour of a Peter Dayley - that naughty boy.

This switcheroo method will not solve all problems. It cannot fix transposed letters, as with sivler and silver; Levenstein distance is likely needed for that. It does nothing for missing letters as in Wlliam. But it does take us well along the path to making some rather dramatic improvements with a very reasonable amount of effort, and I would argue, could be an economical way to improve the accuracy of projects which have already been transcribed but which suffer from accuracy issues. As with all great things in life this algorithm still requires a human's careful eye, but at least it has pointed that eye in the right direction. And when you're looking at 51 million words of text, that's nine-tenths of the battle.

If you're working on a project that could use some accuracy improvements, or have explored other ways of achieving similar results, I'd be very happy to hear from you.

Wednesday, April 4, 2012

Tricks for Transcribing High-Contrast Historical Reproductions

If you spend enough time as a historical researcher, you're bound to come across the black blob. The blob - also referred to by its more technical name: "those letters I can't make out because of the stupid contrast levels on the reproduction" - is far more common than many of us would like, especially in online databases containing copies of original historical materials. It may not be the fault of the digitizers; the problem may have first occured decades ago when the source was transferred to microfilm or microfiche. Whatever the cause, it forces many a historian to squint and hypothesize about what lays behind. This post will provide a possible solution to the blob, using free software and straightforward techniques. It will not work in all cases, but it may conquer some blobs.

The above image is an unadulterated screenshot of a Vagrancy Removal Record from Middlesex County in the 1780s, found on the London Lives website. The original source contains lists of names of those forceably removed from Middlesex County. We've clearly got an Elizabeth "Eliz" and a Joseph here. But the contrast on the image is too high to make out their surnames. London Lives does offer full transcripts of everything on the website. Unfortunately, the transcribers were unable to decypher the names and left these particular entries incomplete. We too could pass them by, but if we are interested in what's underneath we can turn to a photo editing program to make an attempt.

This tutorial uses GIMP, a free open-source image processing program not unlike PhotoShop. Feel free to use the program with which you are most comfortable.

Step 1: Save the Original Image

I was using a Mac, so I took advantage of the handy screen capture feature (Cmd + Shift + 4), which allowed me to snag only the part of the image I was interested in correcting. Alternatively you could save the whole image by right-clicking it and using the "Save As" feature.

Step 2: Open the Image in an Image Processing Program

As mentioned above, if you do not already have an image processing program, try out GIMP. It is free after all.

Step 3: Adjust Brightness / Contrast

Open the "Brightness/Contrast" box located under the "Color" menu. Increase the brightness and contrast. In this example I've changed brightness by 118 and contrast by 103. Play with the sliders to get a result that works best for your particular source. You may even find it works better for you if you decrease one or the other. If you max out the values and need even more brightness or contrast, click OK and re-open the same dialogue box. This will allow you to repeat the process. You should notice some of the black blob beginning to fade and reveal hints of what might be underneath. This will probably occur first closer to the edge of the blob. You may now have all the information you need to finish the transcription. If so, great. If not, keep reading.


Step 4: Colorize

This feature is also located under the "Color" menu. This will help us to make the hidden letters pop out from amidst the shades of grey and black. Feel free to play with the sliders here to see if you can brighten up the results to the point where you are comfortable reading them. Sometimes I find it helps to decrease the "lightness" value while increasing the "saturation".

If you are still having trouble reading the words you can go back and repeat the process by again adjusting the contrast and brightness, and fiddling with the colours even more. If that doesn't work, you can move on to step 5.


Step 5: Trace What you Can See

For this step I like to use a USB tablet and pen, which lets me write to the screen in a fashion that's a bit more natural feeling than trying to draw using my mouse. If you don't have one you can do it with a mouse too. Choose the pen tool from the Toolbox and reduce the "Scale" of the brush to something appropriate for the size of the handwriting. Next, choose a nice bright colour that will stand out against the background colours you have chosen. Then, take your time and trace over whatever letters or bits of letters you can see.


As you can see from this example, we have been quite successful. What was once a "man Eliz" and a "ll Joseph" is quite obviously a "Hayman Eliz" and a "Hill Joseph". We did not get every part of every letter, but we did get enough new information to piece together the missing names.

This process may take a few minutes, but it can be worthwhile if your project depends upon decyphering the letters beyond the black blob. Unfortunately, it will not work in all cases. For this technique to work you do need a black blob with some shading variation. Computers store images as a series of coloured pixels with values ranging from completely black to completely white. Many black blobs are actually very dark grey blobs that look black to our eye. If there are shades of grey in your blob, and those shades correspond with the hidden letters underneath, as is often the case, then this technique may help you peel back the black and find what you are looking for.

Happy transcribing.