Thoughts on Public & Digital History by Adam Crymble: digitization

Showing posts with label digitization. Show all posts

Monday, April 15, 2013

Trust Me: The Old Bailey Online as a model for digitization projects

The Old Bailey Online (OBO) turned 10 years old this week, and to celebrate, Sharon Howard has been encouraging blog posts and tweets from the project's wide network of contributors. I thought I'd add just a few brief thoughts on what I like about the OBO, and why I avoid so many other competing digitization projects. Rather than explain what the OBO is, I thought I'd save time and steal the explanation from their own website:

A fully searchable edition of the largest body of texts detailing the lives of non-elite people ever published, containing 197,745 criminal trials held at London's central criminal court.

The trials run from 1678 to 1914, making it a great resource for social historians or historians of crime. I broadly fit into both of those categories, but what really interests me is knowledge management. I want to know how we can extract useful knowledge from bodies of text far larger than we could ever read in our lifetime. I'm interested in the historical research questions I pursue, but I'm more interested in the processes of understanding and discovery that the pursuing of those questions lets me explore. That is to say: I'm more interested in how we can know something than what we find out. This all means I have slightly different criteria for a good resource than does a typical historian. When I'm planning a project I'm not looking for 'gaps in the literature'. Instead, I'm really only looking for 2 things:

A corpus of downloadable electronic text
A corpus that does not assume I want to read anything

1) A Corpus of Electronic Text

At the moment my work is almost exclusively based on textual analysis. By that I mean I work with words rather than sounds or images or smells or physical objects. I want to know what human knowledge is contained in the symbols on pages. That means for me the best thing you can give me is a good clean set of electronic text. The Old Bailey Online does this beautifully - better than just about anyone else actually - by providing more than a hundred million words of transcription. Most important: the OBO is entirely downloadable. That means I can put it on my own computer and I can measure it, twist it around, write programs to analyse it, use other people's programs...anything I like. No one is going to threaten to sue me or press criminal charges for downloading the records, And best of all, once I have the records I don't have to read them. Because that's not the focus of what I do.

2) A Corpus That Does Not Assume I want to Read Anything

I'm certainly not one to suggest reading is obsolete, or that historians should stop going to the archives. But I'm always disheartened to see new scholarly - usually commercial - databases come online that only allow reading. I'm talking about the ones that cost an arm and a leg to university libraries, let you keyword search, but then force you to read a scanned copy of the original while hiding the electronic text layer.

I find these projects infuriating, and would rather pretend they don't exist than struggle to find a research question that's appropriate for their limited interface. The thing that bothers me most about these gated resources is that the publishers who create them are implicitly saying: we don't trust you. They don't trust us because the only thing they possess that allows them to sell their product is the electronic text. That's the part of the project that cost the most and took the longest to create. They think if that starts floating around on the Internet they won't be able to make money anymore.

The OBO is different because it's non-commercial. The OBO trusts us and encourages anyone interested to use the records to explore human knowledge in any way they see fit. For some that means sitting down and reading from digital copies of the original source. For others like me, it means downloading the entire corpus and measuring the rates of transcription errors, or of the impact of courtroom reporters on the vocabulary used in the records, or on the pace of migration in eighteenth century London.

The OBO and its team have trusted us. And from that have poured forth far more research about early modern crime in London than anyone ever could have imagined. Perhaps more research than we need. Meanwhile, researchers like myself continue to ignore the large commercial databases who lock up access to their resources, and hope intently that these people will learn from what is still the best online scholarly database I've worked with. We're starting to see steps forward from some (see the Library of Wales' Newspaper Collection for a good example), but overall there's room to improve.

Until we see a shift away from mandated reading, I'll stick to resources like the OBO. So happy birthday to the OBO and cheers to the project team for trusting us. I hope it's paid off.

Sunday, February 10, 2013

Identifying and Fixing Transcription Errors in Large Corpuses

"Underwood 11 Typewriter", by Alex Kerhead.

This is the third post in my series on the Old Bailey Online (OBO) corpus. In previous posts I looked at the impact of courtroom reporters and editors on the vocabulary used in the Old Bailey trial transcripts, and at ways of measuring the diversity of immigration in London between the 1680s and 1830s.

Since I'm dealing with a huge amount of text (51 million words, 100,000 trials), I thought I'd turn my attention to the accuracy of the transcription. For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.

But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?

Well as you may have read in the previous posts, I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I've called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.

Not all of the words on the unidentified list are in fact errors. Many are archaic verb conjugations or spellings (catched – 1,657 uses or forraign – 1 use), compound words (shopman – 4,036 or watchhouse – 2,661), London place names (Houndsditch – 877), uncommon names that had not been marked up as such during the XML tagging process (Woolnock – 1), Latin words (paena – 1), or abbreviations (knt – 1,921) – short for “knight”, a title used by many gentlemen in the eighteenth century.

On the other hand, many of these words are clearly errors. We see mistyped letters as in “insluence” instead of “influence” or “doughter” instead of “daughter”. We also see transposed letters as in “sivler” instead of “silver”. And there are missing letters: “Wlliam” instead of “William”. Finding the difference between the real words such as “watchhouse” and the errors such as “Wlliam” amongst the 43,000 terms on the unidentified list is the real challenge.

Checking manually is impractical as these terms appear nearly 200,000 times in the corpus. Correcting every single error might not be worth the effort. However, to get an idea for the types of errors we see appearing and in what proportions, I checked every entry on the unidentified list against the image of the original scanned record during a single session of the court: January 1800. The unidentified words fell into the categories seen in Figure 1.

Figure 1: January 1800 Old Bailey Online transcription errors and the type of error.

The most surprising category here for me is the purple section, which showcases three instances that I would have categorized as typos by the transcribers, but which were actually typos in the original source. This compounds the problem because it means we must acknowledge that in some instances the error is not with the OBO team but is in fact reflecting the content of the contemporary document. From the perspective of a person searching for a particular keyword in the database they may be frustrated by the original error. On the other hand, from the perspective of those who want to be true to the original, that mistake should be preserved. I won't weigh in on that particular issue here, but it is something anyone working to correct transcription errors should consider.

With this in mind we can begin to look at the other categories, and by the looks of things approximately 40% of entries can in theory be corrected if we can figure out the intended word. Admittedly, I only looked at a single session of the trials, and this may not be representative - particularly if we consider Early Modern English, which might lead us to believe earlier trials are more likely to have archaic non-standardized spellings. If however the session from 1800 is roughly representative of a typical session then we should expect to find somewhere in the neighbourhood of 15,000-20,000 errors.

What can we do about it?

How can we automatically find and correct those errors? Given the fail-safes put in place by the double rekeying process, it's already incredibly unlikely that we will find typing errors by the transcribers. That means when we do encounter such errors it's likely only going to happen once or twice, meaning most errors are probably words that appear only once or twice in the corpus and that do not appear on either the dictionary list or the surname list.

That's not to say of course that just because a word appears in the dictionary that it is not transcribed incorrectly; however, at this stage it is much easier to identify those errors that are not recognized words. Unfortunately there are over 30,000 unique words on the unidentified list that appear only once, meaning this is still impractical to explore manually. Luckily the double rekeying means that any mistakes are more likely to be a matter of the transcriber interpreting the marks on the page differently than we might have liked them to than it is a case of fat fingers hitting the wrong key.

The early modern “long S” is the perfect such example. In the early modern era, up to about 1820, it was entirely common to find the letter S represented as what we might think looks like a lower-case “f”. This is the “suck” vs “fuck” problem that the Google N-Grams viewer runs into, as a slew of esses are interpreted as efs. When viewing the result one might be tempted to conclude people had quite a potty mouth on them in the early nineteenth century, as can be seen in Figure 2. Though not necessarily an incorrect assumption, it wouldn't be wise to make the assumption on this particular evidence.

Figure 2: Google N-Gram results for "suck" and "fuck" in the early nineteenth century

When we look through many of the words on the unidentified list it becomes clear that the Long S is a substantial problem. We find examples of the following:

abufes
afcertained
assaffin
affaulting
affize

Or, the other way around:

assair
assixed
assluent
asorethought
artisice

By writing a Python program that changed the letter F to an S and vise versa, I was able to check if making such a change created a word that was in fact an English word. When I did this I was pointed to several thousand possible typos. As I inspected the list further I noticed there were other common errors probably caused by the very high contrast scans of the original documents. These original documents often included missing parts of letters, difficult to read words, or little bits of dirt or smudges that made interpreting the marks more challenging.

Some of the most obvious switches were:

F / S
I / L
U / N
C / E
A / O
S / Z
V / U

Why these particular switches appeared again and again I'm not entirely sure. Some of them are easy to understand: the lower-case C and lower-case E are easy to mix up. Especially when a fleck of dirt shows up in just the right spot on the scan. Others are a bit more difficult to explain, as with U and N, which we wouldn't expect an automated optical character recognition program to have trouble with, but which seems to have stumped the human transcribers repeatedly.

By running these seven sets of letters through the program and testing the results against the English dictionaries I was able to come up with 2,780 suggested corrections. If these are all correct, that simple switching would correct 9,503 typos in the OBO corpus. The results of these changes broken down by letter-pair can be seen in Figure 3.

Figure 3: The number of suggested corrections in the OBO corpus by switching letter pair combinations in misspelled words.

I say suggested corrections because in some cases the switch is actually wrong, or may be wrong. The English dictionaries missed "popery", a common term used to refer to Roman Catholics in the eighteenth century and has instead suggested the unlikely "papery" as an alternative. In 86 cases the switching has come up with two possible suggestions, both of which are English words, at least one of which is obviously incorrect. The unidentified word "faucy" could be "saucy" or "fancy". Turns out it's saucy, referring to the behaviour of a Peter Dayley - that naughty boy.

This switcheroo method will not solve all problems. It cannot fix transposed letters, as with sivler and silver; Levenstein distance is likely needed for that. It does nothing for missing letters as in Wlliam. But it does take us well along the path to making some rather dramatic improvements with a very reasonable amount of effort, and I would argue, could be an economical way to improve the accuracy of projects which have already been transcribed but which suffer from accuracy issues. As with all great things in life this algorithm still requires a human's careful eye, but at least it has pointed that eye in the right direction. And when you're looking at 51 million words of text, that's nine-tenths of the battle.

If you're working on a project that could use some accuracy improvements, or have explored other ways of achieving similar results, I'd be very happy to hear from you.

Wednesday, April 4, 2012

Tricks for Transcribing High-Contrast Historical Reproductions

If you spend enough time as a historical researcher, you're bound to come across the black blob. The blob - also referred to by its more technical name: "those letters I can't make out because of the stupid contrast levels on the reproduction" - is far more common than many of us would like, especially in online databases containing copies of original historical materials. It may not be the fault of the digitizers; the problem may have first occured decades ago when the source was transferred to microfilm or microfiche. Whatever the cause, it forces many a historian to squint and hypothesize about what lays behind. This post will provide a possible solution to the blob, using free software and straightforward techniques. It will not work in all cases, but it may conquer some blobs.

The above image is an unadulterated screenshot of a Vagrancy Removal Record from Middlesex County in the 1780s, found on the London Lives website. The original source contains lists of names of those forceably removed from Middlesex County. We've clearly got an Elizabeth "Eliz" and a Joseph here. But the contrast on the image is too high to make out their surnames. London Lives does offer full transcripts of everything on the website. Unfortunately, the transcribers were unable to decypher the names and left these particular entries incomplete. We too could pass them by, but if we are interested in what's underneath we can turn to a photo editing program to make an attempt.

This tutorial uses GIMP, a free open-source image processing program not unlike PhotoShop. Feel free to use the program with which you are most comfortable.

Step 1: Save the Original Image

I was using a Mac, so I took advantage of the handy screen capture feature (Cmd + Shift + 4), which allowed me to snag only the part of the image I was interested in correcting. Alternatively you could save the whole image by right-clicking it and using the "Save As" feature.

Step 2: Open the Image in an Image Processing Program

As mentioned above, if you do not already have an image processing program, try out GIMP. It is free after all.

Step 3: Adjust Brightness / Contrast

Open the "Brightness/Contrast" box located under the "Color" menu. Increase the brightness and contrast. In this example I've changed brightness by 118 and contrast by 103. Play with the sliders to get a result that works best for your particular source. You may even find it works better for you if you decrease one or the other. If you max out the values and need even more brightness or contrast, click OK and re-open the same dialogue box. This will allow you to repeat the process. You should notice some of the black blob beginning to fade and reveal hints of what might be underneath. This will probably occur first closer to the edge of the blob. You may now have all the information you need to finish the transcription. If so, great. If not, keep reading.

Step 4: Colorize

This feature is also located under the "Color" menu. This will help us to make the hidden letters pop out from amidst the shades of grey and black. Feel free to play with the sliders here to see if you can brighten up the results to the point where you are comfortable reading them. Sometimes I find it helps to decrease the "lightness" value while increasing the "saturation".

If you are still having trouble reading the words you can go back and repeat the process by again adjusting the contrast and brightness, and fiddling with the colours even more. If that doesn't work, you can move on to step 5.

Step 5: Trace What you Can See

For this step I like to use a USB tablet and pen, which lets me write to the screen in a fashion that's a bit more natural feeling than trying to draw using my mouse. If you don't have one you can do it with a mouse too. Choose the pen tool from the Toolbox and reduce the "Scale" of the brush to something appropriate for the size of the handwriting. Next, choose a nice bright colour that will stand out against the background colours you have chosen. Then, take your time and trace over whatever letters or bits of letters you can see.

As you can see from this example, we have been quite successful. What was once a "man Eliz" and a "ll Joseph" is quite obviously a "Hayman Eliz" and a "Hill Joseph". We did not get every part of every letter, but we did get enough new information to piece together the missing names.

This process may take a few minutes, but it can be worthwhile if your project depends upon decyphering the letters beyond the black blob. Unfortunately, it will not work in all cases. For this technique to work you do need a black blob with some shading variation. Computers store images as a series of coloured pixels with values ranging from completely black to completely white. Many black blobs are actually very dark grey blobs that look black to our eye. If there are shades of grey in your blob, and those shades correspond with the hidden letters underneath, as is often the case, then this technique may help you peel back the black and find what you are looking for.

Happy transcribing.

Wednesday, August 17, 2011

How to Record a Presentation for the Web (Well)

By Adam Crymble

Few things are as ephemeral as speech. It is spoken, and it is gone. This is fine if you have just delivered the worst presentation of your life and want nothing more than to forget it. But, there are speeches worth saving. Research is global; not everyone who is interested in the speaker will be able to attend in person. Not everyone who will be interested is interested now – for example, a first year student may want to hear the presentation four years from now when she is working on her Master’s degree.

Academia has developed a solution to the ephemeral speech and it has become increasingly popular. The recorded lecture, often mistakenly referred to as a “podcast”, is a way of archiving what transpired at an event and making it available online. Many conferences and public lecture organizers are adopting this idea to increase the reach of their event to those outside the immediate room in which the presentation occurs.

However, while the solution is in place, the skills needed to enact it well are not. The recording process is frequently an afterthought, thrown into the hands of an inexperienced graduate student or an already taxed session chair. The recorder is left fumbling with a device he or she has likely never used, hoping desperately that they get it right on the first try.

Predictably, the results are usually poor. Even comparatively good examples often suffer from low-quality audio. Frequently, listeners will feel the recording lacks context and they will be frustrated if the speaker refers to slides that have not been included with the recording.

All this can be avoided with a little bit of planning and practice to ensure your recorded presentations are good recorded presentations that do justice to your speakers and your listeners.

Listen to or Watch a Good Example

Start with the best. No one has better online presentations than TED. “Ted Talks” are live presentations by passionate speakers that have been recorded and posted to the Ted.com website. They have become an Internet sensation and anyone considering archiving a speech should watch at least one Ted Talk. I am not suggesting you do as TED did and hire multiple professional cameramen, a director and a sound editing team. What I am suggesting is that you follow TED’s lead on the following key points.

Talk to your Speaker Beforehand

I do not mean simply get permission to record – although of course this is important and you should get permission in writing. Instead, I mean find out what type of presenter your speaker is. Do they use PowerPoint slides, and if so do they own the copyright or have permission to use all of the material? Do they wander around the room as they speak? Do they ask the audience to participate frequently?

By asking questions about the style of presentation the person intends to deliver you can preemptively find solutions to problems before they arise. If your speaker tells you she likes to move around a lot during the presentation, use this advanced knowledge to track down a wireless microphone that can clip onto her lapel. If your presenter plans to use a PowerPoint presentation with images that violate copyright, suggest he look into using images licensed by Creative Commons so that you can legally share his presentation.

Dedicate Someone

As soon as you decide to record the presentation, find someone whose sole job will be to handle the audio equipment and get him or her to practice. Days before the event, the recorder should know exactly how to use the recording equipment, what volume levels are suitable, and how close to the speaker the device will have to be. If the microphone must be clipped onto the speaker, the recorder should try the mic on a few locations on his or her own shirt to see how placement affects sound quality.

If the chair and the speaker are fairly far apart – more than a few feet – then be sure to check if the device will clearly pick up the chair’s voice. If it sounds like he or she is far away or “tinny” then consider getting a second recording device and record both people independently.

The audio testing should be done in the same room as that in which the presentation will take place, and if required, your recorder should make note of nearby power outlets to determine if an extension cord is needed.

By spending even one hour practicing and preparing, your recorder will be confident when the time comes for them to do their job.

When that time does arise, it is best to push the record button well before the presentation starts. The audio can always be edited later, but once a presentation starts – and often they start unexpectedly – what has been missed is gone. Make sure your recorder gets the speaker introduction, as well as the speech.

If the presenter is using slides, have your recorder note the time in the recording when the slides transition. This will make it much easier to combine the slides and the audio later.

The Context of the Room

A major complaint of listeners who access presentations online is the lack of context. When attending an event in person, you have the context of the physical space, the other people in the room, and even other presentations you have heard or plan to hear at the same event. When you listen online, this context disappears.

The chair of the session or the person introducing the speaker can provide this context, as long as they have been warned ahead of time. Most people in this position do their introduction the same way whether they are being recorded or not. That is, they speak only to those listeners in the room and often seem uncomfortable at the idea that people might be listening that they cannot see. Rather than address this virtual audience, they pretend it does not exist.

To get beyond this barrier, sit down briefly with your chair and give them some pointers on providing context to the online audience. One effective way of dealing with this problem is to have the chair acknowledge both audiences in the introduction. Thank everyone for coming, but also thank your online listeners. Provide a short blurb about the event and why you have gathered for it. The listeners in the room will recognize that your blurb is for the benefit of the online audience and will not be put off.

If you are recording multiple sessions with the same audience present, this can become repetitive and strange. In that case, record this context information later and it can be added to the start of your presentations in the editing stage. If you are not sure what context is missing, ask a colleague to listen to the recorded presentation; they will be able to tell you what needs to be added.

Question Period

Decide if you plan to include the question period in your recording. Often this means seeking the permission of everyone in the audience, but will vary depending on your jurisdiction and university policy. The challenge with question period is that it is often difficult to catch the questions on the recording device, particularly in a large room.

One solution is to require people who want to ask a question to go to a microphone. This can be obtrusive and adds to what is already a complex process, so you may decide to end your recording after the speaker finishes the formal presentation. By ending early, one tends to avoid the chair thanking everyone for coming and inviting them to head to the pub; the result is a more professional conclusion.

After the Fact

The work does not end when the recorder pushes the stop button. The audio will have to be edited. If your presenter used slides, ask for them and plan to create a “slidecast” that will pair the audio with the relevant slides. It is also a good idea to get a one to two paragraph abstract of the talk from the speaker, a one to two sentence bio of the presenter, and a half-dozen keywords that allow online visitors to find the presentation. Search boxes still cannot let us find out what is in an audio or video file, so you will need to provide enough information with the recording to let interested people find it.

Once you have received the slides and contextual information, you are ready to edit. This can be done by anyone and need not be the same person who made the recording. However, if you have more than one lecture it is a good idea to dedicate this job to one person. This will ensure that all of the recordings are consistent.

Editing the Audio

There are a number of good audio editing programs available. Audacity is an open source, free program that you can use to edit the audio and to adjust volume levels if needed. Mac users will find GarageBand, preinstalled on most new Mac’s, a useful tool for achieving the same.

If possible, try to avoid too much “dead air” at the beginning or at the end of a file. It is also a good idea to make sure you end the recording at a suitably calm point. Stopping abruptly in the middle of applause is less professional than fading down the volume or waiting for an appropriate break. MP3 is still the industry standard file format for audio, so if given the choice between formats, MP3 is a safe bet.

Adding the Slides

Often with online presentations if slides are available they will only be provided as a separate PowerPoint file available for download. This is better than nothing, but often it is not clear when the speaker transitioned slides and the listener must fumble to figure out which slide to look at. Because your recorder kept notes while listening to the speech, it should not be difficult to combine the slides and the audio into a video.

Again, Mac users should find iMovie installed on newer machines. This program makes it easy to drag slides and combine them with audio. If you do not have a Mac, SlideShare (http://www.slideshare.net), a website dedicated to sharing slides, now allows you to combine audio and slides, and to adjust timing all within your Internet browser window.

Share it

Once you have finished editing the presentation, you are ready to share it. Post it to your event website, department website, or to a video or audio sharing site. Make sure you let the presenter know it is available, and finally, promote the presentation as widely as possible. By promoting the recorded presentation, your conference or lecture can live on beyond the end of the live event and can continue to engage listeners for years to come.

Taking a few moments to plan and adding a little extra time editing will ensure the recorded presentation is almost as good as the original. Some presentations are worth saving, and those that are, are worth saving well.

Adam Crymble is the Webmaster for the Network in Canadian History & Environment, an organization that has archived over 150 academic presentations. Adam would like to thank Sean Kheraj for his comments on a draft of this article.

Monday, October 26, 2009

#apiworkshop Reflection: free the data

I recently attended Bill Turkel's Workshop on APIs for the Digital Humanities held in Toronto and had the pleasure of coming face to face with many of the people who have created the historical data repositories I have used so enthusiastically.

What I came away with was an even stronger conviction that data in humanities repositories should be completely liberated. By that, I mean given away in their entirety.

The mass digitizations that have occurred recently have provided a great first step for researchers. I no longer need to fly to London and sit in a dark room sifting through boxes to look at much of the material I seek. But, I'm still - generally - unable to do with it what I like.

Many repositories contain scanned documents which have been OCR'd so that they are full text searchable, but that OCR is not shared with the end user, rendering the entire thing useless for those wanting to do a text analysis or mash up the data with another repository's content.

Most databases require the user to trust the site's search mechanisms to query the entire database an return all relevant results. If I'm doing a study, I'd prefer to do so with all the material available. Without access to the entire database on my hard drive, I have no way of verifying that the search has returned what I sought.

Many of those at the workshop who administered repositories were willing and eager to email their data at the drop of a hat, but that is not yet the norm. Most of my past requests for data have been completely ignored. When it comes to scholarly data, possessiveness leads to obscurity.

As humanists become increasingly confident programmers, many will define research projects based on the accessibility of the sources. Those who are giving their data away will end up cited in books and journal articles. Those desperate to maintain control will get passed by. If someone asks you for your data, think of it as a compliment of your work, then say yes.