Monday, April 15, 2013

Trust Me: The Old Bailey Online as a model for digitization projects

The Old Bailey Online (OBO) turned 10 years old this week, and to celebrate, Sharon Howard has been encouraging blog posts and tweets from the project's wide network of contributors. I thought I'd add just a few brief thoughts on what I like about the OBO, and why I avoid so many other competing digitization projects. Rather than explain what the OBO is, I thought I'd save time and steal the explanation from their own website:
A fully searchable edition of the largest body of texts detailing the lives of non-elite people ever published, containing 197,745 criminal trials held at London's central criminal court.
The trials run from 1678 to 1914, making it a great resource for social historians or historians of crime. I broadly fit into both of those categories, but what really interests me is knowledge management. I want to know how we can extract useful knowledge from bodies of text far larger than we could ever read in our lifetime. I'm interested in the historical research questions I pursue, but I'm more interested in the processes of understanding and discovery that the pursuing of those questions lets me explore. That is to say: I'm more interested in how we can know something than what we find out. This all means I have slightly different criteria for a good resource than does a typical historian. When I'm planning a project I'm not looking for 'gaps in the literature'. Instead, I'm really only looking for 2 things:
  1. A corpus of downloadable electronic text
  2. A corpus that does not assume I want to read anything
 1) A Corpus of Electronic Text

At the moment my work is almost exclusively based on textual analysis. By that I mean I work with words rather than sounds or images or smells or physical objects. I want to know what human knowledge is contained in the symbols on pages. That means for me the best thing you can give me is a good clean set of electronic text. The Old Bailey Online does this beautifully - better than just about anyone else actually - by providing more than a hundred million words of transcription. Most important: the OBO is entirely downloadable. That means I can put it on my own computer and I can measure it, twist it around, write programs to analyse it, use other people's programs...anything I like. No one is going to threaten to sue me or press criminal charges for downloading the records, And best of all, once I have the records I don't have to read them. Because that's not the focus of what I do.

2) A Corpus That Does Not Assume I want to Read Anything

I'm certainly not one to suggest reading is obsolete, or that historians should stop going to the archives. But I'm always disheartened to see new scholarly - usually commercial - databases come online that only allow reading. I'm talking about the ones that cost an arm and a leg to university libraries, let you keyword search, but then force you to read a scanned copy of the original while hiding the electronic text layer.

I find these projects infuriating, and would rather pretend they don't exist than struggle to find a research question that's appropriate for their limited interface. The thing that bothers me most about these gated resources is that the publishers who create them are implicitly saying: we don't trust you. They don't trust us because the only thing they possess that allows them to sell their product is the electronic text. That's the part of the project that cost the most and took the longest to create. They think if that starts floating around on the Internet they won't be able to make money anymore.

The OBO is different because it's non-commercial. The OBO trusts us and encourages anyone interested to use the records to explore human knowledge in any way they see fit. For some that means sitting down and reading from digital copies of the original source. For others like me, it means downloading the entire corpus and measuring the rates of transcription errors, or of the impact of courtroom reporters on the vocabulary used in the records, or on the pace of migration in eighteenth century London.

The OBO and its team have trusted us. And from that have poured forth far more research about early modern crime in London than anyone ever could have imagined. Perhaps more research than we need. Meanwhile, researchers like myself continue to ignore the large commercial databases who lock up access to their resources, and hope intently that these people will learn from what is still the best online scholarly database I've worked with. We're starting to see steps forward from some (see the Library of Wales' Newspaper Collection for a good example), but overall there's room to improve.

Until we see a shift away from mandated reading, I'll stick to resources like the OBO. So happy birthday to the OBO and cheers to the project team for trusting us. I hope it's paid off.


Tod Robbins said...

Loved the post Adam! How do you access the data dump for London Lives? I noticed as well that the API menu link is only visible when you aren't logged in.

Adam Crymble said...

Hi Tod, I don't think they like to be called 'data dumps', but I'd suggest you contact the London Lives team directly. Bob Shoemaker, Tim Hitchcock, and or Sharon Howard should be able to let you know what you need to know. I'm not actually a member of the project team so I can't offer any advice more specific than that (

Tod Robbins said...

Oh ha! I strictly meant it in the literal sense ( Maybe I should have said CSV/JSON/XML?

Anyhow, Thanks for the suggestion. I think your work on Old Bailey is great!

Edmond Hill said...

The regarded aspects and credential piece ideas have been initiated with all those values which must have been followed by the individual. how to create a database in excel