Thoughts on Public & Digital History by Adam Crymble

Play with the big kids. End.

2014-08-30T13:07:00.002-04:00

This blog was started as part of a digital history class I took back in 2007. In the 7 years since I set it up, I've completed a masters degree in public history at the University of Western Ontario, and have handed in my PhD in history at King's College London.

I am one of a very early group of academics for which our scholarly development is plain to see for future readers. For the rest of my life, my students will be able to go back and read what I thought when I was a masters or PhD student. That's never really happened before. I have no idea what my PhD supervisor was like when he was a student, because his experience was ephemeral and distinctly offline.

On Monday I'll be starting a new chapter, as a lecturer of digital history at the University of Hertfordshire. I'm very grateful for the opportunity and looking forward to it immensely. Since my graduate student days are now at an end, I think it is fitting that I close this blog today, and leave it where it can act as a record to those future students who feel so inclined to see how much brighter they are than I.

But before I go, and because I'm one of the lucky ones who has found my way into an academic job, I thought I'd reflect on the one thing I learned about succeeding as a postgraduate student:

Don't forget to play with the big kids.

All of the people I worried about when I was applying for jobs (the ones I was pretty sure were better than me), were the ones I saw confidently drinking a pint in the midst of a group of senior academics at the pub. The ones sitting at the table of fellow students didn't concern me.

Fellow students can be a great source of friendships. Perhaps even life-long friendships. It's wonderful to make friends your age and I would encourage everyone to do it. But in the short term, there's only so much career advice they can offer.

The student laughing and telling stories in the middle of a group of seniors academics is learning how the academy works from the inside. He or she is picking up tips, learning what selection panels, peer reviewers, editors, examiners, and audience members are looking for. He or she is making connections with scholars at a range of schools who may become collaborators, or who may think of them when a promising student down the line is looking for a supervisor. They'll come to mind when a chapter in an edited collection needs writing, or a third speaker is needed for a panel. They'll have many experienced people to turn to that they can ask: 'would you mind reading this over and giving me your thoughts?' Or: 'What do you know about department X's teaching needs?' And they too are building friendships. Perhaps even life-long friendships.

Those relationships are not merely parasitic. They go both ways. The senior academics too learn from the student, who brings new ideas to the field, or renewed enthusiasm for old ideas. They can challenge the senior scholar to keep up to date with new methods, or to work together on projects of mutual interest. And they too are building friendships. Perhaps even life-long friendships.

Just a few days into my MA programme back in 2007, a man I'd never met named Roy Rosenzweig passed away. He had been the director of the Center for History & New Media at George Mason University. I first heard of Roy through a very thoughtful obituary written by Dan Cohen. In it, Dan commented on Roy's ability to bring people together:

I know of no one with as large an address book and as many friends as Roy. But he didn’t just collect these acquaintances superficially, for show or for his own career ends like so many people do on Facebook or LinkedIn. As his social histories of the United States also emphasize, he viewed every human being as a special resource who brings unique talents and ideas into the world, and he liked nothing more than to connect people with each other.

You may not feel that you're the type who can connect others, but you owe it to yourself to build your address book and your circle of friends. Don't let a gap in age between you and those more experienced than you keep you apart. Meet lots of people. Whenever you can. Learn from them. Listen to them. Teach them a thing or two.

Most of them are willing to talk to you, and if they're not, they're probably just jerks.

You will always not yet know the right people. So next time you're at the pub after a seminar or conference, grab your pint, walk over to the big kids' table, and say confidently, 'So, what did you think of the speaker?'

It might just land you a job, and make new a new friend.

Good luck. And thanks to all my readers over the years. It's been fun growing with you.

Learning Python with the Programming Historian

2014-08-09T03:40:00.000-04:00

For those humanists out there looking to learn Python to aid your research processes, the Programming Historian has a great set of lessons to get you started. The lessons are designed to teach you Python by doing the types of tasks historians might want to do. So instead of learning about managing an inventory of widgets (as is common in intro-to-programming books) you learn how to manage a set of historical sources.

The Programming Historian used to make it more obvious that these lessons were originally written sequentially, so that readers could build upon their skills slowly. It's not quite so obvious anymore because of the new way we've organised our table of contents. But for those of you interested in learning Python, or using it with students, I thought it would be helpful to post their original order here so that you can easily find your way through them.

Happy learning.

Your First Lesson

What to do if you get Stuck (Troubleshooting) William J. Turkel & Adam Crymble

Introduction to Python

The Python programming language is flexible and learners can get impressive results quickly. If you'd like to get a fairly solid grasp on the language, the lessons here should provide you with that grounding

Introduction and Installation William J. Turkel & Adam Crymble
Mac - Python Installation William J. Turkel & Adam Crymble
Windows - Python Installation William J. Turkel & Adam Crymble
Linux - Python Installation William J. Turkel & Adam Crymble
.
Working with Text Files William J. Turkel & Adam Crymble
Code Reuse and Modularity William J. Turkel & Adam Crymble
.
Viewing HTML Files William J. Turkel & Adam Crymble
Working With Web Pages William J. Turkel & Adam Crymble
.
Manipulating Strings in Python William J. Turkel & Adam Crymble
From HTML to List of Words (part 1) William J. Turkel & Adam Crymble
From HTML to List of Words (part 2) William J. Turkel & Adam Crymble
.
Normalizing Data William J. Turkel & Adam Crymble
Counting Frequencies William J. Turkel & Adam Crymble
.
Creating and Viewing HTML Files with Python William J. Turkel & Adam Crymble
Output Data as an HTML File William J. Turkel & Adam Crymble
Keywords in Context (Using n-grams) William J. Turkel & Adam Crymble
Output Keywords in Context in HTML File William J. Turkel & Adam Crymble
.
Downloading Multiple Records Using Query Strings Adam Crymble
.
Installing Python Modules using Pip Fred Gibbs
Introduction to Beautiful Soup Jeri Wieringa
.
Transliterating non-ASCII Characters with Python Seth Bernstein

---

You may also like to supplement your learning with other tutorials. I found Mark Lutz, 'Learning Python' (O'Reilly) very useful. My co-editor, Fred Gibbs, is a big fan of the Code Academy. Use whatever combination works for you. Good luck.

How to Solve Programming Problems if you're learning Programming

2014-04-30T03:43:00.003-04:00

Things can and will go wrong when you first start dabbling in programming. As with all new skills, you are going to get frustrated. Here are a few tips that may help you work through the frustration and solve many of your own problems.

Language Learning

Do not forget you are effectively learning a new language, or even languages. You will not be fluent in twenty minutes. But you can expect to start building your skills to the point where you can complete simple tasks, progressing into more difficult situations. When you are presented with new code, it's important that you take the time to really understand it. If you whip through a tutorial or cut and paste code snippets you find online, you may end up with a program that runs, but you will not understand it, nor will you be able to generalize the skills and write your own programs. Once you've got something that works, try making a backup copy and then changing things in your code one at a time. If you can predict the effect that your changes will have, you have a good understanding of what is going on. When you are surprised by the effect of a change, you have an opportunity to learn something new. When you've got something that doesn't work, make a backup copy and then try eliminating things that you don't understand. What is the simplest version that you can get to work? Once you have that version, you can try adding in new code, one thing at a time.

Search Engines

If you run into difficulties when writing computer code the great news is that the answer to almost any problem can be found online. Computer programmers constantly seek and give help on forums and mailing lists, and the questions and subsequent answers are readily available. This means the Internet is usually your best resource for finding help. If you run into a problem, the first thing you should do is type your problem into a search engine. More often than not someone has already asked your exact question, and other people have provided a range of answers. You might even find entire websites dedicated to solving your particular problem. When you are just starting out, it is very unlikely that you will come across a problem that no one has encountered before. Likewise, if you encounter an error message that you don't understand, cut and paste that error message into a search engine and surround it with quotation marks. Typically you will find dozens of explanations for why this error appeared and how to fix it. The more specific that you can be about your problem, the better the results you will find. Don't be discouraged if you don’t find the answer on your first search. Rephrase the search terms and try again.

Forums

If you’ve Googled it, Yahoo’d it, and tried various combinations of teas, coffees and energy drinks to no avail, you're going to need to ask for help. There are many Internet forums and mailing lists to which you can turn. At the time of writing, my favourite forum for general programming questions is Stack Overflow. There is also Tutor, a mailing list where people who are learning Python can ask questions, and people who are interested in teaching Python can answer them. At any given time there are swarms of friendly, knowledgeable people just waiting to answer your question. If you post your problem in a courteous manner, with a little bit of luck you will have a solution within a couple of hours. Keep in mind that it is bad form to post a question in more than one place in quick succession. It may not be the instant gratification we’ve come to expect, but don't forget, these people are volunteering to help you, and most probably if you're desperate enough to ask for help, you could use a few hours away from the keyboard anyway. Go for a walk, take a nap, or do something else to clear your mind and some new ideas will come to you.

Asking Good Questions

Clarity and specificity are your friends when it comes to asking for help on a forum. The FAQ page on Stack Overflow's website is great reading for anyone looking to ask a question about programming online. Even if you do not use the Stack Overflow forums, the messages here are essential. Remember, the people who read forums and offer their expertise are busy; make it easy for them by carefully thinking out your problem before you ask. Likewise, make sure you are asking a specific question to a narrowly defined problem. For example, don’t post something like: "Why won't my code work?" Instead, try: "Why am I getting a syntax error when I try to Push a value into an Object?" Always post the relevant section of your code (and only the relevant section of your code) along with your question. If possible, remove any unnecessary bits that are not immediately relevant to the question to make it easier for experts to help you solve your problem. If the answer you get does not do the trick and you are still stuck, be polite and try rephrasing the question. Remember, don’t bite the hand that feeds you; these are volunteers and they’re trying to help you!

Debugging

When fixing problem code, systematically change one thing at a time and retry your program after each change. Often if you make three or four changes before retrying the program, you will solve one issue, but cause another one. This is frustrating and confusing. By changing one thing at a time and making sure it works before moving on, you will prevent a lot of confusion. It also helps to make notes of the things that you have tried, and of the solution when you find one. The more time that you spend programming and debugging, the more familiar various kinds of errors will become.

---
This post was originally published as part of the Programming Historian, and was co-authored by William J. Turkel. It has been reposted here without permission because all work on the Programming Historian is licensed under a CC-BY license. Photo credit: Peter Alfred Hess

Does your online collection need an API?

2014-01-18T03:11:00.001-05:00

Crymble Awards, Best of 2013

2013-12-07T13:44:00.001-05:00

For this, the third year running (2011 & 2012), I've decided to acknowledge five projects who have most influenced my academic development in the past year. Winners have come up with ideas or shared their knowledge in a way that's had a real difference on the way I've approached my own work. This influence isn't always possible to measure by counting up citations in footnotes, but it's important to recognize.

Narrowing the list down to five projects each year is a challenge; there is so much great work going on that's worthy of praise. Nonetheless, I present to you my Crymble Award winners for 2013. Thank you for your inspiration.

1) Jorge Cham and Meg Rosenburg, 'Big Data + Old History' PhD Comics.

Belovedly known for his academic comic series, PhD Comics, Jorge Cham challenged PhD students to describe their thesis in two minutes, with the promise that he'd animate the twelve best entries. I was fortunate enough to be selected as one of the winners, and I'm thrilled by how Cham and his colleague Meg Rosenburg transformed my words into an engaging two-minute cartoon. I've had great feedback from the video (I think a couple people even think I'm cool now), and it's certainly showed me that the written word is not the only way we can share stories about the past.

Thanks very much to both Jorge and Meg for including me in the project. It was a great experience and a lot of fun.

2) Adam Frost and Tobias Sturt, The Guardian Data Blog.

In March I attended a one-day masterclass hosted by The Guardian newspaper on data visualization and visual storytelling. My award goes to two of the presenters on the day: Adam Frost and Tobias Sturt, both of whom worked at the time for the Guardian's Digital Agency (for-hire Guardian visualization experts). The pair hosted a great session on how the team at the Guardian Data Blog take raw data to the finished products which capture our imagination.

I'd definitely recommend the workshop - though I note they've raised the price from £99 to £250 since I attended. Not only did I see some great data visualization examples, but it got me thinking about the importance of answering the oft-ignored: who cares? As Frost noted that morning, without clarity and persuasion, data is just a spreadsheet. Visualization is about bringing data to life for an audience, and I'm grateful for Frost and Sturt for making that so clear for me. The graphs and visualizations I've been creating for my thesis have changed markedly as a result of their work, and I like to think it has been for the better.

3) Jelle van Lottum, 'Labour Migration and Economic Performance: London and the Ranstad, c. 1600-1800', The Economic History Review. Vol. 64, No. 2 (2011): 531-570.

Van Lottum is a historian at the University of Birmingham, but this paper was part of his British Academy fellowship at the University of Cambridge a few years ago. I stumbled across Van Lottum's article when researching some background material for an paper I was preparing with some colleagues on lower-class migration into eighteenth century London. It was a bit of a eureka moment for me, as I had been fumbling around in a new field, unsure even of what I'd been looking to do, and this article provided me with exactly the type of framework I was after. Since reading this paper I've been drawn into an entirely different side of history, and read much more widely than I might ever have done, deliving deeply into the work of some talented economic historians and historical demographers. I've found these new fields a wonderful compliment to my interest in social history, and I owe that in part to what turned out to be a great article by van Lottum.

4) Anne Alexander, Social Media Knowledge Exchange (SMKE).

Anne wins an award for standing up for students' fiscal needs. SMKE was a year-long project that invited students to pitch an idea related to social media and academia. Winners were offered a £500 budget and £500 for themselves. As someone funding my own education, this was an incredibly important opportunity for me. There are so many organizations out there offering funding to pay for conference travel, or for student-run initiatives to pay to bring in senior speakers, or even for expenses to go visit libraries. And yet I get the sense that there's a desperate attempt to ensure students aren't trusted with any money they might use to live on. Those who consider themselves older and wiser and who control the purse-strings of granting agencies of all sizes seem convinced that any money they give directly to students will go straight to the pub. I spent mine on my tuition bill. And I thank Anne heartily for giving me that opportunity by taking a stand and putting value on student work.

I incorporated Anne's idea into a workshop I hosted last month, passing on the bulk of the funding I had to early career students who gave talks at the event. I'd challenge others to do the same. Don't reimburse travel for students; offer grants or honorariums to students for participating, and empower them to make their own decisions about how they get there or where they stay. Thanks to Anne for showing me that.

5) Shoaib Sufi, Neil Chue Hong, Aleksandra Pawlik, et. al. Software Sustainability Institute (SSI).

This year's final award goes to the team at the Software Sustainability Institute, based at the University of Edinburgh. Until I saw the call put out by the SSI looking for fellows late last year, I had never even given a thought to the idea of software sustainability. Since applying (and winning!) sustainability has been a major part of my strategy for all of my work. I was introduced to a wonderful network of people in fields ranging from engineering and physics, to computer science and geography, all who are struggling to ensure their work is useable for the long-term. As a society we put so much time and money into building tools and programmes to assist our research, but so little into planning for the future of that work.

As part of my fellowship I was given £3,000 of funding, which I used to host an event, Sustainable History: Ensuring Today's Digital History Lasts. The event was held at the Institute of Historical Research in London (cohosted by Jane Winters and Tim Hitchcock) and brought together a great group of scholars and information professionals to discuss what historians can and should be doing to ensure their projects and their data survives. It's a little question, but one I'm glad the team at the SSI challenged me to think about.

* * *

Congratulations to this year's winners, but more importantly, thank you to all of them for shaping the way I approach my research. You join a very talented group of previous winners. Keep on inspiring!

Winners for 2012 and current affiliations:

Julia Flanders (Northeastern University)
Luke Blaxill (University of Oxford)
Peter King (University of Leicester)
Andrew Marr (BBC)
Fred Gibbs (University of New Mexico) & Miriam Posner (UCLA)

Winners for 2011 and current affiliations:

Tim Hitchcock (University of Sussex) & William J. Turkel (Western University)
Tim Sherratt (National Library of Australia)
Ben Schmidt (Northeastern University)
Sean Kheraj (York University)
Jeremy Boggs (University of Virginia)

As a final aside, eight of the twelve previous winners have moved on to new institutions and more impressive positions since winning. Can we thank the Crymble Awards for tipping their applications over the threshold? I suppose we may never know...

Is Creative Commons Flexible Enough for Historians?

2013-10-29T09:04:00.000-04:00

Gumby and Monkey, by Joe (CC-BY-SA)

Creative Commons licenses are incredibly useful. They're easy to use. More and more people understand them. It's even possible to do web searches of Creative Commons content making it easy to find content you can use with confidence. The Open Access movement, particularly in the UK, seems to be promoting Creative Commons licensing as the best way to move towards open access to research, because it means we can (largely) leave lawyers out of it all and implement a standard set of licenses that everyone understands (or should understand). I see the practical merits in that and am a big fan of keeping costs at a minimum. But I also see the counterpoint, that many historians feel Creative Commons just isn't designed for them (see my previous post on Alternative Licensing). Sometimes that feeling is based on a misunderstanding. Sometimes, I think, it's justified. In the interest of opening that discussion, I thought I'd present a couple of scenarios in which I believe Creative Commons is not flexible enough for historians looking to manage the rights associated with their research.

For all of these scenarios, let's assume the work in question is an academic monograph written solely by me.

1) Supporting certain derivations

What I want: I'd like people to be able to translate my book into a range of formats (braile, French, audio, stage performance) without having to ask me, provided that every effort is made to ensure that the translation accurately represents the arguments and positions of the original, and the translator is listed as such on the title page or where applicable. This reuse is only permitted if the entire work is included in the translation.

Why this is important to me: I'm a big supporter of accessibility; I wouldn't want anyone working to provide access to my work for the blind to feel they were prevented from doing that good work by a legal restriction.

Why CC is not sufficient: CC-BY would allow this type of reuse. But it would also allow someone to translate only the introduction, or to pick and choose parts and rearrange them in a way that changes my message. I'm worried if they do that someone might get the wrong idea about my work. You may not think that's important, but it's my book and my reputation, and I am worried. I could use a 'no derivatives' license, CC-BY-ND, but I do want to allow certain types of derivatives under certain conditions.

2) Supporting certain commercial reuses

What I want: I'd like professors creating course readers to feel empowered to use parts of my book with their students. I'd also like private individuals to be able to use individual chapters in edited collections with modest print runs (let's say less than 500). I don't want Evil Publishing Ltd to be able to do the same without asking.

Why this is important to me: I'm a big supporter of ensuring students and my colleagues have access to my work. I also think it's important to support small entrepreneurs. But I know that the publishing industry is big business, and if they're going to make big money from my ideas, I think it's fair to ask that I get a cut of that. Anyone who has ever licensed stock imagery to use on a website or in print knows that the price of the license changes with the number of 'impressions'. In essence, the bigger the advertising campaign, the more money they want to charge you to use the image. This merely attempts to apply those types of restrictions on my book.

Why CC is not sufficient: CC-BY wouldn't give me the power to put the restrictions on Evil Publishing Ltd that I believe is important. Forcing me to use CC in this instance forces me to give away rights I would like to hold onto.

* * *

Those are just a couple of simple examples, which I don't believe are far fetched when considering licensing and reuse from the historian's perspective. For them to work, I think at the very least we need to adopt a CC-BUT license, in which creators are allowed to add restrictions to their license. As I said before, if the concerns of licensors aren't met, they won't get on board. I'd like them on board, but that may need to come at the expense of what seems on the surface to be a simple CC solution.

Sustainable History: Ensuring today's digital history survives

2013-10-28T05:44:00.002-04:00

28 November 2013
10am-3:30pm
Institute of Historical Research,
Malet St, London
WC1E 7HU
Register: https://www.eventbrite.co.uk/event/8989595121

How long will our digital research survive? Historical scholarship is increasingly digital; and yet we do not have an agreed form of best practices for ensuring that digital scholarship lasts. Speakers at this one day workshop will share practical advice on a range of pressing issues for historians and cultural heritage professionals working with digital material. From ensuring research data is archived safely, to building sustainable strategies into your project workflows, and even learning from the mistakes of others, this event promises practical solutions for big challenges facing digital scholarship.

Registration is free, but spaces are limited (Register here: https://www.eventbrite.co.uk/event/8989595121).

Sponsored by the Software Sustainability Institute, the Institute of Historical Research, the Programming Historian 2, and The AHRC Theme Leader Fellowship for Digital Transformations.

Programme

Registration / Welcome 9:30-10:15am
Keynote Addresses 10:15-11am

Professor Andrew Prescott, (Dept. of Digital Humanities, King’s College London)
Neil Grindley(JISC)

Tea / Coffee 11am-11:15am
Session 1: Preserving Resources 11:15am-12:15pm

Dr. James Baker (British Library) ‘Preserving Research Data for the Future’
Jennifer Doyle (King’s College London) ‘Working with Cultural Heritage to Open Research Data’

Lunch 12:15pm -1pm

Session 2: Working Together for the Long-term 1pm-2:30pm

Dr. Gethin Rees (University of Cambridge) ‘Capturing and Documenting Workflows for Historical Scholarship’
Mia Ridge (Open University) ‘Sustaining Collaboration from Afar’
Claire Donaghue (Imperial College London) ‘Strategies for Working Together on Large Projects’

Tea / Coffee 2:30-2:45

Keynote Address and Open Discussion 2:45-3:30

Dr. Peter Webster (British Library)

Academic Freedom License: An Alternative to CC-BY

2013-10-24T11:13:00.002-04:00

Professor Peter Mandler, President of the Royal Historical Society allegedly made this comment today at an Open Access event held in London. I was not at the event, but I have heard this concern expressed before: CC-BY licenses allow someone to take an academic work, completely twist the words of the author, and republish it in a way that suggests those are the opinions of the author (either intentionally or through ignorance).

The fear is certainly valid, whether you agree with the interpretation of the license or not. No academic would be happy with the idea of someone twisting their words and republishing something that, if misconstrued, could damage their reputation as a scholar.

I'm inclined to suggest that a CC-BY license does not in fact grant these rights, as the fine print about 'moral rights' points out, noting that 'derogatory treatment' of the licensor's work is not permitted.

Nevertheless, the terms of the license do suggest it is up to the licensor to monitor and police this activity, and if necessary, turn to the courts to enforce it. That's just not practical for a busy academic.

Remixing isn't the only problem. Copyright of images or graphs can also be an issue. Anyone who gives a public lecture these days will be familiar with the release forms that you're asked to sign that require you to grant someone the right to reproduce images and graphs you don't own that happen to be on your powerpoint slides. Academic monographs have the same problem. How can we release our content as open access if the work contains someone else's work for which we have had to ask permission?

If I'm not mistaken, these two issues are the biggest objections to CC-BY licenses for the humanities and social sciences. Thankfully, Professor Mandler has offered another solution, and I'm all for solutions:

New License needed for HSS (Humanities and Social Sciences)

What a fabulous idea. What on earth are we waiting for? I present to you all for consultation: the Academic Freedom License, designed specifically with the needs of academics in mind, that both promotes open access and reuse, and prevents the types of abuses outlined above.

Academic Freedom License

For works released under an 'Academic Freedom License', you are granted the right:

To Share - to copy, distribute and transmit the work in its entirety only.
To Analyse - to data mine and study the work and publish or create work of your own based on that analysis.
To Sell - to make commercial use of the work in its entirety only.

Under the following conditions:

Attribution - You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work)

Excluding - You are prohibited from sharing, analysing, or selling any aspects of the work specified by the author or licensor (such as images under copyright or sections not produced by the author)

With the understanding that:

Waiver - Any of the above conditions can be waived if you get permission from the copyright holder.

Public Domain - Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.

Other Rights - In no way are any of the following rights affected by the license:

Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
the author's moral rights
Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights

Would you buy a product to support digital humanities?

2013-10-10T03:32:00.000-04:00

I'm launching an experiment, and I'd love for you to be involved. My PhD funding has just run out and I've been given a final £350 tuition bill during what's known as my 'writing up' period. In my search for solutions to cover this cost, I've found dozens of small grants that will pay for me to buy train tickets or hotel stays for research trips or conferences. I've found dozens more that will let me buy train tickets or hotel stays for others to come to a conference I'd organize. I can even get money to buy equipment for my research projects. But no one will give me money to pay my rather modest fees.

So I've decided to be creative. Crowdfunding has become rather trendy lately. Sites such as Kickstarter are even being taught as part of digital humanities courses, suggesting those of us in the field need to get out there and convince the public to part with some money in support of the research we do. Shawn Graham at Carleton University is now using this idea to raise money for an Undergraduate Scholarship in digital history, with funds to be matched by his university if he meets a certain threshold, and I wish him the best with what I consider a great initiative. But I know the marketplace can only handle so many campaigns that take the same form. So I've decided to go another route and ask: would you buy a product if you knew the procedes went to support digital humanities? Or more specifically: helped to pay my tuition fees?

So I've teamed up with Cafepress, and designed some digital humanities schwag to tempt you into my experiment. I've focused my product line on three key areas for the digital humanities:

Bags and Electronics - Your electronics never looked so digital humanities
Baby Clothes - Your baby makes digital humanities look good
Mugs and Water Bottles - Support digital humanities while you drink

All of my profits will go directly towards my tuition fees. And in the interest of this experiment, I'll report back on progress at the end of 2013 when these limited edition products will disappear FOREVER! Are baby clothes the key to the future of digital humanities? We'll soon find out.

I thank you most humbly for your support.
Edited note: It has been wisely pointed out to me that not everyone needs baby clothes or more 'stuff'. If you'd like to contribute directly, I've set up a link through Paypal where you can do so. Thanks again.

Digital Humanities Comic 'Big Data + Old History'

2013-09-09T04:23:00.000-04:00

You used to submit an abstract to a conference to share your findings. Now you ‘Dance your Thesis’ or compete to convince a world-class cartoonist to animate your research and turn it into a video. The modes of disseminating research have broadened in the past decade, with students in particular being offered a range of new contests designed to get them thinking creatively about engaging the public with academic research.

Jorge Cham, the internationally renowned animator behind ‘PhD Comics’, asked students ‘can you describe your thesis in two minutes?’ Cham then chose the best descriptions and turned them into animated cartoons. I'm very pleased to announce my entry was one of the winners, and the animated video of my thesis has just been released:

My two-minute talk focused on how distant reading has been central to my PhD research. There's only so much detail you can fit into a two minute talk, but I hope has been able to introduce the idea of distant reading to a much wider audience and that some of them might take the step to learn more. It's been a great experience, and I'd like to thank Jorge and his team for creating this opportunity. And since getting selected as one of the winners was partially down to voting from the public, I'd also like to thank everyone who took a moment last year to vote for my entry. The response has been wonderful. So thanks again.

I hope you enjoy the result.

Applications open for Five Solutions: Digital Sustainability for Historians

2013-08-31T12:54:00.001-04:00

APPLICATIONS OPEN!

Five Solutions to What?

Historical scholarship is increasingly digital; and yet we do not have an agreed form of best practices for ensuring that digital scholarship lasts. Five Solutions is looking for five scholars able to outline a solution to the issues of sustainability now facing historians. This one day workshop asks participants to give a 15 minute presentation outlining practical solutions to one of five challenges, with the resources and expertise of an ordinary working historian in mind. These presentations will form the basis for a one day workshop on practical strategies for digital sustainability. The presentations can be based on your own experience and ideas, or can be taken on as a research project. We will work with all participants to ensure that the final presentations are both technically workable and illustrated with the most appropriate datasets.

Accepted participants will each receive a £350 honorarium.*

The Five Themes

The following five themes are designed to get you started, but if you have other ideas, we’d love to hear about it. Each theme should be approached with the ordinary working historian in mind.

1. Preserving research data for the future

2. Curating an enduring professional online persona

3. Paying project costs after the money runs out

4. Capturing and documenting the expertise of temporary staff

5. Strategies for working together on larger projects

Who Should Apply?

We’re looking for people with passion. Scholars old or young, university students of any level, librarians, archivists, developers, designers, system administrators, or anyone who considers themselves a historian at heart. No specific qualifications or prior experience required - just an interest in helping academia find solutions to organizational and technological challenges facing the sustainability of our digital projects.

What do I have to do?

Figure out a solution, of course! Once you’ve come up with your solution, you’ll share your work in two ways:

1. A 15-minute presentation of your solution at a one-day conference in London, UK on the 28^th of November 2013 at the Institute of Historical Research.

2. A 1500-2000 word peer-reviewed tutorial outlining your solution to be published in the spring of 2014 in the Programming Historian 2 and distributed as part of ‘IHR Digital’.

All tutorials will be peer-reviewed and released under a Creative Commons CC-BY license. Participants will have the full support of an editor at the Programming Historian 2 who will provide guidance for writing an effective, practical tutorial.

Evidence of previous work with technical writing or a willingness to learn, as well as a strong command of the English language are a bonus.

How do I apply?

By 8 October 2013 send a two-page C.V. and a brief email to adam.crymble@kcl.ac.uk (subject line: Five Solutions) addressing the following questions:

1. What theme would you like to tackle? (Use one of our suggestions or come up with your own.)

2. Give us an idea of how you plan to solve this issue, or where you intend to look for a solution (max 200 words)

3. What skills or experiences make you the ideal person for the task?

We apologize in advance, but we are limited to five scholars.

* Our funding restrictions allow honorariums for UK-based participants only, though we are happy to receive applications from those abroad who have access to their own travel funding and who would like to participate.

Project Support By

And by the AHRC Theme Leader Fellowship for its Digital Transformations Theme.

Can We Reconstruct a Text from a Wordcloud?

2013-08-05T04:55:00.002-04:00

We’ve all seen Word Clouds. Many of us have even wondered if they’re of any value. I have used word clouds in the past; I find them useful in presentations when I want to highlight the relative importance of certain words over others. For example, I often use this word cloud to the left, to show the most common Irish surnames in the London area during the early 19^th century. I hope my listeners will note that Murphy or Sullivan is more common than Burke or Foley, without me having to take the time to explain the connection between word-size and significance.

I’ve also used word clouds in analysis. In a previous post I discussed how I was able to use the below word cloud to show the relative frequency of topics found in the Gentleman’s Magazine between 1800 and 1820, which allowed me to get a pretty good idea of what the gentry and the middle class were interested in during that period.

I think both of those uses for word clouds have been productive. They’ve allowed me to transmit ideas, and formulate my own thoughts on a set of data in an effective manner. But I began to think about other uses, and I began to wonder about the process of getting back to the original data. Word clouds take the individual words (tokens) out of context. As I mentioned in my last post, we think in metaphors, or ideas. Not in words. That means a word cloud reduces a single idea such as “green bowl” into two tokens “green” and “bowl”. It then combines the word “green” into a single graphic based on how often it appears in the text. The program does not take into consideration the fact that “green” as it refers to a bowl is entirely different than Mr. Green or Green Park. An article about Mr. Green’s picnic in Green Park with his favourite green bowl might give you a skewed idea about the importance of the word green, here representing three completely different ideas, and in all three cases simply acting as modifiers to more important concepts (a man, a park, and a bowl).

Just for fun, I decided to do a test. I asked 4 colleagues, all experts on the criminal trial transcripts of the Old Bailey Online, to look at word cloud of a trial. Each person was asked to describe what key information they could tell me about the crime. I was interested in knowing if they could tell me the who, what, when, where, why type details, and if they could reconstruct the basic building blocks from the prevalence of certain keywords. In the spirit of exploration, I played along as well and offered my own interpretation.

The word cloud was created at random by my wife without my knowledge of the trial. All I knew (and all the participants knew) was that the trial took place between 1801 and 1820, and was between 2,000 and 3,000 words long. The word cloud was limited to 75 unique words and common English words were removed. The resultant visualization looks like this:

I have colour-coded my assessment (blue) for aspects I got correct and (red) for the bits I got wrong. What struck me immediately was that this was a case involving theft. That’s a safe bet anyways, since about 50% of Old Bailey trials during this period were theft cases. It was the large number of nouns that led me to this conclusion. I know that trial transcripts always list the items that were stolen, and the testimony in the trial almost always discusses the various objects repeatedly as several witnesses are called to give their account of what transpired. In this case, I’d assume there was a large quantity of alcohol that went missing, ranging from red wine, to port, to gin – stored in bottles, measured by gallons, and in at least one case: a cask.

At the time it was stolen the booze was being stored in a cellar before it was transferred to a cart that was being driven by a horse – a mare to be specific. Why Restoration actress Nell Gwyn appears in the set of words, I have no idea since she died over a century before this era, unless the name is a coincidence or perhaps refers to the name of a pub that lost its liquor.

There were a large number of people involved with giving evidence against the defendant including Messers Hutt, Wells, Powell, Wood and Bagnigge, as well as possibly a Mr. Limbrick, and definitely someone named Hart – though again that may be the name of the pub. One of those men is likely the watchman and another an officer. Based on what I know about Old Bailey trials, this suggests there were a lot of witnesses, meaning the prosecutor was concerned that his case may not have succeeded.

The alcohol heist took place in the morning (and was perhaps discovered the following night), and the goods were then transferred down either Maiden-lane or City-road. Given the volume of goods stolen and the fact that death appears in the list suggests our defendant was found guilty and sentenced to death.

My conclusion: a pub named either Hart or Nell Gwyn located on City-road or Maiden-lane was robbed of a large volume of alcohol by a solo male defendant, who was found guilty and sentenced to death.

Analysis

You can compare my assessment with the full trial transcript. It turns out I wasn’t that far off. There had been three defendants, but they were found guilty and sentenced to death. It had been a large alcohol theft. I wasn’t able to accurately pick out the fact that Messers Wood, Powell, and Hart were the defendants, meaning the who aspect of the challenge had completely eluded me. I also didn’t recognize Bagnigge Wells, which was the location of the crime, not someone’s name. Nell Gwyn in this case was the pub that had its door pried open to reach the alcohol.

Participant 1

“Powell and Hutt were found guilty of breaking into the wine cellar of the Nell Gywnn inn in Bagnigge Wells (I know about this place) They were accused of stealing two gallons of wine, casks of gin and bottles of port from the cellar belonging to Mr Hart . A Watchmen Mr Wood on his round at 1 o'clock saw a broken iron (lock)on the wine cellar door and saw two men drive off in a horse and cart down Maiden lane towards City Road and immediately called for an officer. The men were stopped by an officer who examined the cart and found the cask of wine, gin and port belonging to Mr Hart and arrested them. They received the death penalty.”

Participate 2

“I would guess it involves stealing a hamper of goods including a bottle of wine from the back of a cart. My suspicion is that the defendants were two women, and that the servant of the person who owned the cart/hamper discovered the theft and was called in evidence, though the actual owner wasn't. The cart was on its way to or from northwest London to banigge wells, for a social occasion and some visiting, it happened at night (suggesting they were returning home) and there was a runner or 'officer' involved in the arrest.”

Analysis

In these examples, the participants tried to be very specific about the details of the transcript, and in doing so were incorrect more often than not. The basics of the case, including the location of the crime, was however, correct for participant 1. This would suggest that expert readers are able to get some of the basics – though by no means all. However, that expertise does little to bring forth the specifics of the trial.

Participant 3

“This trial seems to have a richer vocabulary than most. It looks like a theft case from a wine cellar (wine, port, cask, gin, gallons, hampers, bottles, etc. suggest as much), presumably at Bagnigge Wells; actually the prominence of the word cart, and also the word horse, suggest the material might have been in the process or coming or going there, perhaps parked on the City Road (or Maiden Lane). Went suggests action. Some force was used: broken, crow, saw, which suggests that the cart was broken into. There is a certain amount of vocabulary indicating how the culprit might have been apprehended: officer, stopped, examined, watchman, observed, charge—this suggests that officers were used to apprehend the defendant. There are some names: Powell, Gwyn, Hart, Mr, Nell, William. The most frequently mentioned, Hart, is presumably the victim. Timing is important: o'clock, morning, night—the prominence of night suggests that that is when the crime took place, with the suspect arrested in the morning? Numbers indicate either the number of items stolen or the time of day (one or two in the morning).”

Analysis

This one came out surprisingly accurate. The participant hadn’t recognized Nell Gwyn as the name of the actress. As in my own case, it proved impossible to figure out who was the defendant and who was the victim. However, the details of the crime and the process of apprehending the defendant is almost bang on. This participant didn’t try to reconstruct the narrative in the same way as Participant 1 or 2, and thus avoided many of the pitfalls. However, there has been no guess at the verdict, and while the basics of the trial are here, the richness of what actually is recorded in the transcript is nearly entirely lost.

Participant 4

“Geographical location

Wine Cellar –property crime

Bagnigge Wells – recognise this as the location of a rather seedy spa/pleasure

grounds on the outskirts of Clerkenwell towards Kings Cross.

City Road – not too far away runs from the City to Islington

Maiden Lane – small street that runs parallel to Covent Garden Piazza on the

south-side although there might be other Maiden Lanes

Time –mention of time o’clock and watchman (usually worked only at night)

and night although mentions morning (in the or the next)

Crime probably burglary because of the mention of the relevance of time

together with watchman

Stolen Goods: bottles of red wine, port, cask, gin from wine cellar

Broken into cellar with an iron bar, maybe crow bar

Probably took it away or hid it in a cart, certainly it is central to the plot and

perhaps discovery of the stolen goods. There may well have been a significant

amount of drink gallons, casks and bottles? Hampers – might have been used

to transport/hide the goods

Arrested – yes (examined) and officer (probably from one of the Police Courts

Observed - spotted

Names Hart – think this is a personal name rather than pub sign and possibly

Wood as it’s mentioned frequently but not as much as the things I think were

actually stolen

Gwyn –forename Welsh? Ha – just spotted nell (same size) and I know that

Nell Gywn was supposed to have lived at Banigge Wells. John, William, think

Hutt is a personal name (not a shed)

Death –punishment (so guilty)”

Analysis

This one was particularly interesting, because the participant included their thought process as it related to the various words on the visualization. The fact that it was clearly a theft of goods, and that time was mentioned, lead this person to conclude we were dealing with a burglary, which was correct. The conclusion that Hart was the name of a person rather than a pub was correct, but equally could have been wrong. While Nell Gwyn may have lived at Bagnigge Wells, that wasn’t really relevant to this case.

Conclusions

Can an expert on a historical source reconstruct the details of that source from a word cloud? It would seem the answer is: sort of. Of the five participants, two (#3 and #4) did an incredibly good job of getting some of the details. These people were able to reason out what the words meant by drawing upon their experience with the ways certain types of words were used in criminal trial transcripts. However, in both cases they were light on the details, and it would appear decided not to guess on elements of the crime that they couldn’t be confident in. That is, they spoke confidently when they were confident but otherwise stayed silent.

I think my analysis fell in the middle. I got lots of bits right, but I was also wrong just about as often. I was disappointed that I couldn’t pick out the roles people played in the trial from the word cloud. There was no way to guess who was the defendant, who was the victim, and without recognizing the names, who were the officers. I wasn’t even able to guess how many defendants were on trial. On the other hand, I did guess the type of crime, the verdict, and a few details and circumstances surrounding the arrest. Having said that, I can’t say I was confident in all of my conclusions and was guessing.

And finally, two of the participants were way off (#1 and #2). These two attempted to reconstruct the narrative of what had happened, providing a level of detail that involved a good deal of guesswork.

What does this mean? I think it shows that when faced with a simplified visualization such as a word cloud, the process of getting back to the original is fraught with a level of guesswork. However, an expert in the source material can, with reasonable accuracy, reconstruct some of the more basic details of what’s going on.

How do we move forward? Well, as I pointed out earlier in this article, I think the secret is in moving away from the idea that tokens transmit ideas. Ideas and metaphors transmit ideas, and it would be far more useful to have an idea or concept cloud than one that focuses on individual tokens. But I also think it’s time that those ideas were linked back to the original data points, so that people interpreting the word clouds can test their assumptions. We are ready to see the distance between the underlying data and the visualization contracted. We’re ready to see the proof embedded in the graph. And I hope we continue to see a development in this trend.

Thanks to my participants, Janice Turner, Bob Shoemaker, Louise Falcini, and Tim Hitchcock.

Can you explain this graph to me? Peer Reviewing a Visualization

2013-08-01T05:27:00.001-04:00

"For sale: Mixing bowl set designed to please a cook".

That opening sentence contains 10 words, or "tokens" as linguists often call them. Yet either in its spoken or written form, it really only transmits 4 ideas, or what I imagine Marc Alexander would call "metaphors", which are concepts that go beyond the words but that express meaning and understanding. They allow us to think in chunks.

What?: For sale
What's for sale?: a mixing bowl set
What's it like?: designed to please
Please whom?: a cook

The same sentence represents an attempt to conjure a very measured set of thoughts in another person. I can't take credit for the sentence, but when the author wrote it down, they hoped that you, dear reader, would understand those 4 ideas in the same way as all the other readers, and as they themselves understood them. It's their attempt to control your mind temporarily by drawing upon your understanding and memories associated with those 4 ideas. We may not get all the details exactly the same. Your mixing bowl set may be blue. Mine is seafoam and has spout on each bowl to make it easier to pour your batter into the baking tin. So we likely havn't had exactly the same understanding of the sentence, but our understandings are almost certainly within the limits of what's acceptable to the author.

If we add 2 more ideas to the end of the sentence we end up with a failed conjuration:

"For sale: Mixing bowl set designed to please a cook with a round bottom for efficient beating".

Because of the misplaced modifier, there are now two ways to understand these ideas. Does the bowl have a round bottom for efficient beating, or should the cook who will enjoy the bowl be so proportioned?

Visualizations can offer the same ambiguity.

Is this an image of a rabbit, or a duck?

In this case, it's both, and it's that very ambiguity that the artist intended us to understand. Not all visualizations are intended to teach us something specific, or to so carefully conjure a series of ideas in our minds. That's wholly too modernist for some. Visualizations can be exploratory, used by researchers to come to a different understanding of their data by slicing it in lots of ways until they see something interesting. Or, as I demonstrated in an earlier post, can be a quick way to get a distant look at a large amount of data by reducing it to something easier to digest. In that sense graphing can aid the discovery process of research even before the conclusions are ready to be shared with the world.

But when it comes to visualizations for academic publication, unintentional ambiguity is something we must strive to avoid. If done well, there should only be one proper way of interpreting the visualization. It's our job to create something that can conjure specific thoughts in the reader's head based on the graph's shape, colour, size, orientation, etc. And it should go without saying that those conjured thoughts should be grounded in rigorous research.

As academics we spend so much time and care on our prose, and even our footnotes. Usually (we hope) that prose comes out lucid and if we're lucky, is enjoyable to read. One of the ways we ensure that is through peer review. The editors help us find people who are willing to take the time to read what we've written and provide constructive feedback upon it.

Yet few of us feel we have the aptitude to offer similar feedback on visualizations. We're not visual artists and so we can be forgiven for using colour in confusing ways, or for thinking a pie chart with 100 categories is a good way to express an idea. As I mentioned previously, I'm quite confident that in the present climate, unique looking or impressive visualizations will slip through peer review unchecked, lest the reviewer's lack of expertise in visualization be exposed by making a comment to the effect of "I don't under stand this graph".

Now, far be it from me to suggest we only use column graphs or line graphs, or that we do X, but not Y. I think it's fantastic that so many people out there are pushing the boundaries of what we can achieve via visualization. The folks at the Guardian Data Blog do great work on bringing data to life, and are a wonderful place for anyone seeking inspiration.

Instead, what I would suggest is that as creators of academic visualizations, we make sure our graphs are reviewed, even if our reviewers cannot or will not do so in the traditional peer review process.

The way I'd propose we do that is to show our friends and colleagues what we've made as often as we can, including during the drafting process. But it's not just about showing them. We have to ask the right questions. Let's use the graph below as a (relatively poor) example of a visualization that we might like to get feedback on. Please note that this is not a graph showing real data about the cost of grain in the 19th century. It's just an example.

Most of us likely want to ask "Do you like my graph?" or "What do you think of this?"

A more productive starting point is probably: Can you explain this graph to me? You aren't going to be there when your reader or viewer is interpreting your graph. The best way to find out what set of ideas are going to form in their mind is to ask them to explain their thought process out loud.

In this case, I had intended to show the seasonal difference in the price of grain in London and Edinburgh over a 20 year period. You may not have picked up on that, which means I need to fix something.

Don't be affraid to ask explicitly: Is there any element of this graph that you do not inherently understand? Make sure they can explain the labels on both axes (if relevant). If they don't know where you're getting those values from you may need to rethink your axis labels. You'd be forgiven for asking what the numbers on the Y-axis represent in the example. I didn't label it, so how could you know?

When you start experimenting with your visualizations, you're bound to come up with ideas you think are clear, but that just don't translate into ideas that your reader can interpret. Looking at the sample graph, I wouldn't fault you for asking what the top and bottom line of the curves represent. They're supposed to be two line graphs: one representing Edinburgh prices, and one representing London. I've shaded in the space between the lines to emphasize the size of the gap. If this is in fact two lines, then which one is Edinburgh? Which one is London? And when they overlap, how do I know which bit corresponds to which line? Do they cross, or merely meet and diverge again? I havn't made the fact that this is a line graph obvious because the lines aren't distinguishable from the shape formed by the colours.

Speaking of colour, you'll want to make sure you havn't come up with a palette that is going to make interpreting your graph difficult for someone with colour blindness. There are many different forms of colour-blindness, so it pays to run a test on your graph. You can do this online by using a "Colour Blindness Simulator" on your finished image.

Sticking with the negatives, ask your tester which element of the graph they like the least. For the sample graph, they may say they don't like the colours, or the font, or the legend. Personally, I think using --------> to represent arrows looks lazy. Everyone will have their own opinions on what's worst about your work. If you know what turns people off you can make visualizations that people like. And if they like the visualization, readers are more likely to engage with its message. With this in mind, go ahead and ask if they like your graph. Or if there are any elements of the graph that they particularly fancy.

Just as with your prose, it may take a few iterations and a number of different opinions from colleagues before a graph says to others what you think it says in your own mind. Just because you submitted a graph with your article and the peer-reviewers didn't comment on it doesn't mean you've done a good job of clearly expressing your ideas visually.

And one last question to ask, just to make sure your readers get the right message and aren't distracted: does the shape of the graph make it look like anything unrelated?

Graphs and visualizations have tremendous potential for expressing ideas in academic research, but it's not a skill we're typically taught in school. Most of us learn on the job, or emulate graphs we saw elsewhere that we found effective. Taking the time to ensure the graphs you create transmit the right ideas to your reader is good scholarship. Knowing the right questions to ask makes it that much easier to reach that result.

Questions to ask about a visualization:

Can you explain this graph to me?
Are there any elements you do not inherently understand?
Can you explain what each axis shows (if applicable)
Will people with colour blindness be able to differentiate your colour palette? (check online)
What do you like least about the graph?
Do you like the graph / a particular element of the graph?
Does the shape of the graph make it look like anything distracting?

Students should be empowered, not bullied into open access

2013-07-23T07:28:00.003-04:00

'Bully Free Zone' by Eddie-S

The American Historical Association (AHA) has just adopted a resolution in support of recent graduates, encouraging them to feel empowered to keep their dissertations offline while they seek a publisher to turn that dissertation into a scholarly monograph.

Surprise, surprise, open access advocates everywhere have started snivelling.

No! they cry. We shouldn't support a resolution passed in good faith to protect the career progression of new scholars against scholarly presses that are allegedly refusing to accept manuscripts based on openly available dissertations. We should be burning books and the organizations that publish them. Down with books, up with free information on the Internet!

Lovely, but you can't eat free information. Makes a shit shelter as well.

Now, I certainly understand, sympathize, and even agree with the complaints of the open access community. Trevor Owens posted some great suggestions last night for ways to amend the AHA statement into one that recognizes some real flaws in the publication / promotion / tenure model that is over-reliant upon books. I certainly agree with Owens that it makes no sense to leave career progression of historians in the hands of acquisition editors at famous scholarly presses.

I'd also suggest that the AHA's claim that history is a "book" discipline is a bit too narrow. From where I live in London England, hundreds of thousands of people make their living either directly or indirectly off of history. That can be anything from freelance tour guides who offer historic walks through the City, to the cafeteria workers in the museums and historic sites, the actor who draws you into his theatre for a rendition of Richard III or the actress who portrays Elizabeth Woodville in a television series, or even her Majesty the Queen whose very presence and connection to a historic institution draws in millions of tourists every year.

The AHA's perspective is probably flawed in terms of the negative reaction of presses towards open access of dissertations. A yet to be published (and open access) article Do Open Access Electronic Theses and Dissertations Diminish Publishing Opportunities in the Social Sciences and Humanities suggests that the vast majority of publishers are willing to consider submissions based on openly available theses.

With all of this in mind, let's give the open access community what they want: You're right.

But dear God you're obnoxious.

The decision of the AHA to support this measure is nothing but a well-intentioned gesture designed to protect and empower those at the most vulnerable point in their career from a perceived threat. How could anyone could criticize them for that? The AHA and scholarly societies like it are not the enemy, and they don't operate to keep scholarship in the 19th century. They exist to promote the interests of their members, and that's exactly what the AHA has done with this resolution. If you want to change their direction, join them. Run for positions of power within their ranks, and influence the opinions of their membership. The historians who belong to these organizations aren't stupid, so if your ideas are good and your models sound, there's no reason we can't expect gradual change towards open access.

Both scholarly monographs and open access have their merits. We shouldn't be pushing for either / or, just like we havn't driven actors from the stage because we have television. Scholarly monographs are an effective way of preserving historical knowledge; they're in a format that the vast majority of us understand and even appreciate. We don't need to give that up.

And while I can appreciate the advantages of open access, its advocates often ignore the problems of an open access model. We live in a society in which things that have no cost have no perceived value. You wouldn't expect your lawyer to work for free, so why your historian? The scholarly presses defend their (failing) business model because it keeps their friends and family employed, their kids fed, and their bills paid. This isn't just a matter of profits funneling into the pockets of the rich. It's the way people like you and me make modest and honest livings.

If we start giving everything away we're promoting a model in which certain professions operate without the security of a paycheque while others doing important work continue to charge for their services. It's all well and good for open access advocates to tell us the benefits of their model, but until they come up with some solutions for its failings, they won't gain any friends who are sitting on the fence. Especially not if every well-intentioned effort by a scholarly society is met with a hostile barrage on Twitter by an extremist perspective that ignores the fact that we're all on the same team: We love history and we want to spend our careers sharing it with others.

If you want to give your dissertation away online, by all means do so. But it is your dissertation. You should feel equally empowered to bury it in a hole in the back yard, or throw it off a bridge. Anyone who tells you that you're bound by some moral obligation to give it away has a job, or a trust fund, and has no business putting any demands on your labour. Even if your scholarly book never earns you a cent, it's your prerogative to try and flog it any way you like. That doesn't make you a bad person. Neither does withholding your thesis from the Internet if you think that will help your pursuit towards a career that allows you to provide for your family. I hold my right to support my family far above your right to read my ideas for free.

I wholeheartedly want to thank the AHA for standing up for and empowering new scholars. No good deed goes unpunished, but there are many of us out there who appreciate your efforts and look forward to continued progress in what we hope becomes a civil debate and progression towards increased open access.

The Role of Blogging in the Academic Feedback Cycle

2013-05-18T09:09:00.000-04:00

Feedback Diversity is Good

Last year I delivered a couple of research papers on the history of crime. The first was in October at the Institute of Historical Research or the IHR as it’s known, here in London. The second was in January, on a beach in Belize. I thought I'd talk a little bit today about how those two experiences were different, how they were the same, and what place I think each holds in the future of scholarship.

Now before you start looking for tropical conferences on 18th century crime, I should qualify that the first paper was delivered to a room full of people. The second was posted on my blog while I was on vacation – and yes, sadly, I DID write about 18th century crime while gazing out over the Caribbean Sea. For some people speaking to a room and blogging are probably significantly different activities. But for me they aren’t all that dissimilar. Let me explain why by talking about what I got out of both experiences as well as what went into them.

At the IHR, I presented an hour-long paper based on three chapters from my PhD thesis. It was about two years worth of work that I had condensed down and tried to make engaging for a room full of people. For about two months before I gave the talk I didn’t do much other than scramble to get the research done, create the graphs build the powerpoint presentation, and craft the 8,000 words that I was to deliver. It was an incredible amount of work. I wore a jacket and tie, and I think I might have even gotten a haircut. Good thing because some of the most eminent crime historians in the world happened to be in town and decided to come to my talk. In all, I think there were about 50 historians in the room, most of whom knew far more about crime and the eighteenth century than I do.

The talk was followed by a really engaging discussion – at least from my perspective. I had a number of people offer suggestions for improving my argument, or on sources and archives I should visit. A couple of scholars who also write on similar topics challenged my findings – though were collegial and offered their own suggestions. Afterwards we continued onto the pub and to dinner as a group and over the course of the evening I must have heard ideas, criticisms, and praise from about 25 individuals on what I was doing.

The beach was a very different experience. The paper itself was just shy of 3,000 words, so somewhere in the 20-25 minute range if I had delivered it orally. This time my paper was based on some quick research I’d done just before Christmas. In total I’d invested a little more than a week analyzing the use of language in the Old Bailey Proceedings over a two hundred year period. It was really nothing more than an idea I'd wanted to test out, based on a conversation I'd had at the pub concerning the size of the lexicon over time.

The results I came up with were what you might call half-baked. Not that I’d been lazy, or that I didn’t know what I was talking about, or that the results were wrong. Just that I hadn’t spent weeks or months revising my methodology and my prose as I had at the IHR. Nor did do an in depth literature review. Instead it was more an activity in play. I had some sources, I had an idea, I tried it out, I wrote it up – with a reasonable amount of care – and I posted it to the world, curious to see what it thought.

The world doesn’t scare me, although many postgraduate conferences suggest it should. It’s not uncommon for these postgraduate affairs to advertise the fact that they are collegial, and a safe place to try out ideas. No senior academics are going to be on hand to put you in your place and tell you how wrong you are. I’ve never been one for intellectual safety, so I don’t see putting a half-baked idea before the world as one of risk. Rather it’s one of potential. But it’s also one of uncertainty and often loneliness.

When I posted my paper on the blog, there was no beer and pizza afterwards – though I did have a nice swim. And in the end I got one comment on the post from Ben Schmidt at Harvard who offered a suggestion for improving the methodology and the results.

On the surface it looks like the blog post was significantly less successful, since the number of comments I got were 25 at the physical presentation, and only one on the blog. But I don’t think that’s quite fair, for a couple of reasons.

Firstly, the talk at the IHR was a formal affair presenting years of research, with a moderator that gazes around the room encouraging more questions from the audience. The blog post was a way to test an idea, which is shouted into the great wilderness. That level of anonymity readers of blogs enjoy means there isn’t the same pressure to respond. But just because they don’t respond doesn’t mean they didn’t engage with the content. It’s difficult to know how many people engage with content on the Internet. I know 50 people were in the room for my seminar paper at the IHR, and I didn’t notice anyone sleeping, but even then I can’t be sure who disappeared into the recesses of their mind as I talked away.

My blog however offers statistics, and though I know not everyone who visits a blog post reads it, I do know about 600 people came to take a look. That’s about 12 times more than showed up to hear my seminar, and because a blog post is printed on the Internet rather than delivered orally, vanishing on the wind as it’s spoken, my blog readers could be anywhere in the world, and could even have been sleeping when I delivered it.

But what I think is important is not how many people read the blog post. Rather, it’s the diversity of the people who did so. The seminar at the IHR was attended almost exclusively by specialists in 18th century British history. The blog reaches a much more diverse audience who typically come through one of two channels:

• Twitter
• Digital Humanities Now.

When I post a new blog post I then post a notice on Twitter letting my followers know. If I’m lucky a few people will notice and share it with their followers on Twitter, or will write a response on their own blogs. And if I’m really lucky a group of scholars in Virginia who run a blog called Digital Humanities Now, which post the best blog posts of the day related to digital humanities, will tell their audience about my post, sending even more people. That’s basically what happened in the case of my Belizean blog post. I published it to the blog, told Twitter, was re-tweeted by a few people, and was showcased on the Digital Humanities Now blog.

That meant my audience included a large number of digital humanists who work in a wide range of academic disciplines including linguistics, computational analysis, literary studies, and history. I think it’s fair to say most of that audience doesn’t care about 18th century British history. However, they do share an interest in the methodology I used to work with the sources. One of those digital humanists, Ben Schmidt, posted the comment that helped me refine my methodology and come up with even stronger results.

No one in the IHR seminar was going to give me that type of feedback because that’s not the type of expertise they have. Instead they focused on the details related to the history of crime or on the records they knew of in the archives. So by seeking out a different audience through the blog, I was able to get interdisciplinary feedback on my work.

History seminars are extraordinarily valuable, particularly for early career scholars like myself. The level of intimacy you get in that type of environment is unparalleled. But they’re a bit like poorly designed focus groups. If you want to take the pulse of the nation on welfare reform or Euroskepticism, you don’t want a room full of Horse and Hound subscribers. You need the diversity of a few Daily Mail readers thrown in the mix, who see the world from a slightly different angle.

And I think the blog and twitter provide that diversity for me. In my case, my blog attracts a lot of digital humanists, but blogs aren’t just a way to get feedback from digital humanists. I posted another blog post a few weeks later on the same research material, this time focused on using criminal records to measure immigration. I again got one comment, but this time it was from Tim Hitchcock, a historian of 18th century Britain, who offered a historical interpretation that might explain what I had found. Tim’s expertise with the provenance of the records meant he knew things about the sources I didn’t.

I posted a third blog post again on a slightly different topic, and received different types of comments again from linguists, computer scientists, and Sharon Howard, the project manager from the Old Bailey Online project. With three blog posts and roughly the same number of words as my seminar paper, I’d engaged a number of different types of people from all over the world with very different sets of expertise, and different types of feedback than I could ever expect to get from a room full of crime historians.

Which experience was more valuable? The seminar or the blog posts? For me, I don’t think there’s much that can compare with a room full of world experts devoting their combined experience to listening and critiquing years of your hard work. I also don’t think you can beat the type of connections that can only be made in a face-to-face meeting at the pub, or over pizza with people who share your interests. But I also don’t think we should sniff at a model that allowed me to test 3 ideas in an informal setting, get a broad range of feedback from interdisciplinary experts all over the world, and all without costing anyone a penny.

I’ve taken on board all of the feedback I’ve received from these two papers. My PhD thesis is stronger for having delivered the seminar paper, and I’ve decided to pursue the ideas expressed in my blog more formally as a future research project. So these papers were both valuable in their own right, and I think I’m a better historian for having delivered them.

This is the text of my talk at 'Our Criminal Past: Digitisation, Social Media, and Crime History' held at the London Metropolitan Archives, 17 May 2013. With thanks to Heather Shore for inviting me to speak.

Is the Programming Historian 2 a MOOC?

2013-04-20T12:30:00.000-04:00

'Evil Robot' by Jennifer Morrow (cc-by)

A few months ago I was asked if the Programming Historian 2 is a MOOC. For the uninitiated, a MOOC is a Massive OpenOnline Course. They’ve been popping up online for the past couple of years, principally at major American universities like MIT and Stanford, claiming to be able to teach thousands or even hundreds of thousands of students at the same time – for free. They’ve so far had mixed results but it seems most people in academia have an opinion on them – either, meh it’s a fad, damn we gotta get one of those at our school, or the robots have come for our jobs! Defend! Defend!

I can’t speak for the other editors of the Programming Historian 2 (PH2). But I can say: No. I don’t think the PH2 is a MOOC. If you havn’t found us yet, the PH2 is an open access series of tutorials designed to let humanities researchers get their toes wet with computer programming. The lessons involve learning simple programming tasks that are immediately useful to ordinary working humanists. That might be automatically downloading historical recordsfrom the Internet, or analyzing a collection of sources with topic modeling. All of the lessons are online – like a MOOC – and there is no teacher in the room with you – like a MOOC.

So why no MOOC? For me, what sets a MOOC apart from a classroom-based course is a belief that the tutor-tutee relationship can be depersonalized and made redundant. MOOCs replace this relationship with a series of steps. If you learn the steps in the right order and engage actively with the material you learn what you need to know and who needs teacher?

I don’t think that’s what we’re about. Instead, some of the most exciting feedback we’ve got at the PH2 has been from academics who have used the PH2 as a teaching tool in their classroom. Either they’ve assigned lessons for their students to work through, they’ve challenged students to write lessons of their own, or they’ve used the PH2 to teach themselves a skill that they can then pass along to their students.

That’s not to say you can’t use the PH2 to teach yourself some programming if you havn’t got a teacher. It’s to say the PH2 is not the evil robot looking to take your job away. It’s the friendly robot looking to give your teaching toolkit a few more options, and maybe a new skill or two with which to impress your friends and colleagues. Not unlike a book. And Books havn’t put literature professors out of a job, but they have made English lit courses more interesting.

Trust Me: The Old Bailey Online as a model for digitization projects

2013-04-15T07:38:00.000-04:00

The Old Bailey Online (OBO) turned 10 years old this week, and to celebrate, Sharon Howard has been encouraging blog posts and tweets from the project's wide network of contributors. I thought I'd add just a few brief thoughts on what I like about the OBO, and why I avoid so many other competing digitization projects. Rather than explain what the OBO is, I thought I'd save time and steal the explanation from their own website:

A fully searchable edition of the largest body of texts detailing the lives of non-elite people ever published, containing 197,745 criminal trials held at London's central criminal court.

The trials run from 1678 to 1914, making it a great resource for social historians or historians of crime. I broadly fit into both of those categories, but what really interests me is knowledge management. I want to know how we can extract useful knowledge from bodies of text far larger than we could ever read in our lifetime. I'm interested in the historical research questions I pursue, but I'm more interested in the processes of understanding and discovery that the pursuing of those questions lets me explore. That is to say: I'm more interested in how we can know something than what we find out. This all means I have slightly different criteria for a good resource than does a typical historian. When I'm planning a project I'm not looking for 'gaps in the literature'. Instead, I'm really only looking for 2 things:

A corpus of downloadable electronic text
A corpus that does not assume I want to read anything

1) A Corpus of Electronic Text

At the moment my work is almost exclusively based on textual analysis. By that I mean I work with words rather than sounds or images or smells or physical objects. I want to know what human knowledge is contained in the symbols on pages. That means for me the best thing you can give me is a good clean set of electronic text. The Old Bailey Online does this beautifully - better than just about anyone else actually - by providing more than a hundred million words of transcription. Most important: the OBO is entirely downloadable. That means I can put it on my own computer and I can measure it, twist it around, write programs to analyse it, use other people's programs...anything I like. No one is going to threaten to sue me or press criminal charges for downloading the records, And best of all, once I have the records I don't have to read them. Because that's not the focus of what I do.

2) A Corpus That Does Not Assume I want to Read Anything

I'm certainly not one to suggest reading is obsolete, or that historians should stop going to the archives. But I'm always disheartened to see new scholarly - usually commercial - databases come online that only allow reading. I'm talking about the ones that cost an arm and a leg to university libraries, let you keyword search, but then force you to read a scanned copy of the original while hiding the electronic text layer.

I find these projects infuriating, and would rather pretend they don't exist than struggle to find a research question that's appropriate for their limited interface. The thing that bothers me most about these gated resources is that the publishers who create them are implicitly saying: we don't trust you. They don't trust us because the only thing they possess that allows them to sell their product is the electronic text. That's the part of the project that cost the most and took the longest to create. They think if that starts floating around on the Internet they won't be able to make money anymore.

The OBO is different because it's non-commercial. The OBO trusts us and encourages anyone interested to use the records to explore human knowledge in any way they see fit. For some that means sitting down and reading from digital copies of the original source. For others like me, it means downloading the entire corpus and measuring the rates of transcription errors, or of the impact of courtroom reporters on the vocabulary used in the records, or on the pace of migration in eighteenth century London.

The OBO and its team have trusted us. And from that have poured forth far more research about early modern crime in London than anyone ever could have imagined. Perhaps more research than we need. Meanwhile, researchers like myself continue to ignore the large commercial databases who lock up access to their resources, and hope intently that these people will learn from what is still the best online scholarly database I've worked with. We're starting to see steps forward from some (see the Library of Wales' Newspaper Collection for a good example), but overall there's room to improve.

Until we see a shift away from mandated reading, I'll stick to resources like the OBO. So happy birthday to the OBO and cheers to the project team for trusting us. I hope it's paid off.

Programming Historian 2 Lessons I'd Like to See

2013-04-03T07:10:00.000-04:00

I've been actively part of the Programming Historian 2 team for the past two years and I've been really pleased to see so many people using and learning from the site, including a number of university courses. I learned to write Python code from the original Programming Historian, and I still regularly reference skills and techniques found in the lessons in my day-to-day research.

My role as an editor of the project means I help guide lessons contributed by others through peer review and editing. I'm also always looking around the blogosphere for people working on cool new techniques or writing guides of their own that I think would be useful for practicing historians. For the most part this is a passive process. I sit, I wait, and I watch. But every once in a while I come across something I'd really like to see. So rather than wait, I thought I'd post my personal wish list of Programming Historian 2 lessons I'd like you to write for all of us.

In no particular order:

How do you turn a spreadsheet into a database and write custom queries? The jump from an Excel spreadsheet which you can see to a MySQL or sqlite3 database that you can't see is not an easy one. A lesson on making this leap would be well received and widely used I would imagine.
What the heck do you do with topic models? The entire digital humanities world seems fixated on topic models these days. Our most popular lesson by far is a tutorial on Getting Started with Topic Modeling and MALLET. But what are the cool things we can do once we HAVE generated topic models? What can we know? How do we use it responsibly? How do I interpret all these numbers and topics?
What can we do with our sources once they've been downloaded? I see so many people using programming to curate sources, but far fewer people asking historical questions of their sources using programming. What are some of the ways we can actually answer questions about the past with programming?

I'd be very happy to hear from anyone who'd like to take on these challenges and create a Programming Historian 2 lesson, or from anyone with an idea of their own they think others could benefit from. Check out our submission guidelines and be in touch.

Voluntary Article Processing Charges for Scholarly Journals

2013-03-24T13:55:00.000-04:00

The Article Processing Charge (APC) has started to rear its ugly head in many academic fields and it's threatening to spread wider, particularly in Britain as the government moves towards mandated open access publishing of research. This move means that publishers will lose out on subscription revenue and have instead turned towards APCs to compensate for that lost revenue. The idea here is that the author pays an APC (which could be anything from a few pounds to tens of thousands depending on the journal) and the publisher agrees to provide open access to the article.

The model isn't perfect, but it is realistic for many publishers, provided that no one is turned away if they cannot afford to pay. It turns out at least one not-for-profit journal has been able to adopt just such an idea that protects those vulnerable, while raising funds at the same time. The Journal for Open Research Software, run by the Software Sustainability Institute (of which I am a fellow - though I am not affiliated with the journal) offers a voluntary APC:

If your paper is accepted for publication, you will be asked to pay an Article Publication Fee of £25 to cover publications costs...You will be able to pay any amount from nothing to full charge, as we recognise that not all authors have access to funding, and we do not want fees to prevent the publication of worthy work. The editor and peer reviewers of the journal will not know what amount (if any) you have paid, and this will in no way influence whether your article is published or not.

I'm not sure how well this policy has worked for the Journal, but I have to say I'm incredibly enthusiastic about it for a few reasons. Firstly, it acknowledges openly that publishing - even open access publishing - DOES cost money. That money needs to come from somewhere, and APCs, like 'em or hate 'em, are one such solution. Secondly, it acknowledges that not everyone has a research budget - students, emeritus scholars, independent scholars - and that these people should not be squeezed out of the system of research publishing because of their career status. And thirdly, it's a creative solution that's taking on the challenge of raising money for publishing that thinks a little outside the box.

We're all going through changes in terms of publishing and academic funding. I for one am pleased to come across examples such as this that are facing those changes with optimism and ingenuity.

The Two Data Visualization Skills Historians Lack

2013-03-13T06:04:00.000-04:00

Four Stages of Data Visualization, by Tobias Sturt at the Guardian

To create a great data visualization you need four skills. You don't have all of them. That was the message of Tobias Sturt and Adam Frost of the Guardian at a recent masterclass on data-vis held in London. The pair both work for the newspaper's "Digital Agency", a for-hire data visualization consultancy company run by the paper. Frost's role is to work with the data and find the story. Sturt determines the most appropriate chart style and the design that will help the reader interpret and engage with that data. That doesn't mean Frost knows nothing about the strengths and weaknesses of certain types of charts, or that Sturt runs away shrieking when he sees a spreadsheet. It does mean they each bring strengths to the table which allow them to create engaging visualizations that are true to the underlying data. That's what good collaborations achieve and anyone that's seen the outputs of the Guardian's team knows they're an incredibly talented group.

Where do historians fit in? I'd say most of us are like Frost. We can handle our data, be it numbers or words, or images, or material culture. We interpret what we see. And we find the story that adds the context to that data. According to Frost and Sturt, these two steps bring the integrity and meaning to the audience. But when it comes to data, words aren't always the best way to present them, and raw data in tabular form (as we've all seen so many times in journal articles) is what Frost refers to as "clarity without persuasion".

That means we need to find and work with the Tobias Sturts of the world. We need to collaborate with those with an eye for colour and form, who can take numbers and turn them into understanding. Without people like Sturt, the above visualization would be nothing more than it's raw data:

Data
Story
Chart
Design

But we get so much more from his visual representation of those four ideas, and few of us have the skills to compete with the creative power of designers. They know things we don't. They know how colours make us feel or what they imply. They know you're more likely to believe a statement written in Baskerville than Comic Sans font. They understand how your eye scans a page, what it's looking for, and how the location of certain elements on the page or the size of those elements change the way we interpret them. They know what we don't.

The question is: where are these people and do they want to work with us?

I'm afraid I'll have to disappoint you and admit: I don't know. Sturt is likely out of the price range for most academic historians. His clients tend to be corporations looking to develop their brands, or large non-profits trying to reach huge audiences. But we all know there are artists out there looking for work. It seems to me the issue may be that we havn't yet realized we need each other, so we havn't yet had to build those relationships. We could say those artists have failed to market themselves to us, but unless we let them know we're interested, we can hardly blame them for ignoring us.

So maybe the best way is to ask. Artists: how do we find you? What should we be looking for in an artist? And what would you look for in us?

Making Open Access and the UK's Scholarly Society Work

2013-03-04T07:43:00.002-05:00

This past Friday at a one-day colloquium on Open Access I learned why academic publishing is so expensive, and I was disappointed to discover that resistance to open access from scholarly societies is not linked to the costs of publishing, but to the cost of non-publishing activities. The UK is in the midst of a heated debate about Open Access, following the Finch Report and an incoming policy that will require all research funded by the taxpayer to be published open access. For this to work, publishers are to be paid up front for lost revenue in what has been called the "Gold Model" of author pays for publication.

Nearly everyone agrees open access is a good thing, but how to pay for it is a matter of contention. The government's policy works much better in the sciences where large research budgets are common and a few thousand quid for publication costs is a drop in the bucket. The Wellcome Trust's representative Simon Chaplin argued at the colloquium that they've been funding this practice for years and thought it was a great use of money.

I don't disagree with Chaplin, but few historians will ever see a grant the size of a typical Wellcome Trust award that can run hundreds of thousands or millions of pounds. Many historians operate entirely without funding, but those working in academic departments will have to find the money to publish in an open access format, else their work will not "count" towards the 2020 REF (the UK's program of counting up who does good research, used to disseminate future research funding). The government's proposal is also potentially disastrous for early career researchers who will find it difficult to secure funding to publish and who may have to choose between paying for food and "investing" in their career by paying for publications. Why would a department give a temporary employee (eg, Post Docs) access to funding for publishing that could go to permanent staff, when there's a good chance that employee will be contributing to another university's research outputs by the time the tallies are next taken?

While I did symapthize with many of the positions speakers took at the colloquium, it was the position of the scholarly societies in particular that I found most frustrating.

Let me first say that I think scholarly societies are wonderful. In particular I think they have been instrumental at supporting promising early career researchers through funding, bursaries, prizes, fellowships, and opportunities to publish. I should also note that I have been employed by a scholarly society since 2008 and take pride in the work we do.

What I do not like is how many scholarly societies get their money, which became clear to me this past Friday. Jane Humphries, President of the Economic History Society, spoke on the business model of her society. According to Humphries, 1/3 of their income comes directly from the subscriptions raised by the society's journal. These subscriptions are then used to fund the activities of the society rather than to pay the costs of publication alone. Humphries argues that without these subscriptions the society could not continue to function, which is a major push behind resistance to open access because most societies and publishers assume they will be forced to take what amounts to a paycut under the proposed models.

One of the activities of the Economic History Society is to fund 5 postdoctoral fellowships at a cost of £70,000. This fellowship scheme is a wonderful one and it's something I'd be very sad to see discontinued. However, it is NOT a publishing cost. Instead, the subscriptions are increased well above the cost of publication in order to participate in non-publishing activities. That means libraries are being charged a surplus. And libraries get much of their money from the pockets of students paying tuition who are indirectly funding these postdoctoral fellowships without a say in the matter. While the scheme is entirely and undoubtedly good intentioned, the society is not working as hard as it could to reduce the costs of publishing because it has a vested interest to constantly increasing its income and expanding its activities. They are effectively robbing Peter to pay Paul. And I'm Peter.

The problem therefore is not that publishing is expensive. It's not that open access is bad. It's that publishing in its current model pays for other good things which will not be supported under the new model. But that does not mean these wonderful extra activities need to cease, or that open access will not work. It means we need to get behind scholarly societies to find a new way to fund these activities.

So what can we do about the lost income? Well we might need to get creative, but here are two ideas.

Fundraising

I've yet to see any scholarly society attempt to fund a postdoctoral fellowship through crowdfunding on Kickstarter or a similar service. No one likes to pay taxes, but many people are willing to support a specific initiative. A £50 annual membership fee to a scholarly society feels much different than does a £50 donation that I know will go directly towards a fellowship.

Many societies also have natural connections to certain types of businesses, which could surely be approached for donations. In particular I'd imagine the Economic History Society, based near London's financial core, and peopled by many a former London banker-turned-historian could make use of its personal network to solicit donations from their sector. Saying you don't like to ask people for money is not an excuse, particularly if the alternative is to continue taking it from unwilling students.

Wikipedia runs entirely on a fundraising drive and I've never thought ill of them for it. In fact, I gave them $50 last year to support their continued activities.

Advertising

Ads are entirely under-used in academia. The Old Bailey Online is one of the few academic projects I've seen that freely uses Google Ads to cover some of the project's ongoing costs. There is absolutely nothing immoral about allowing someone to underwrite a society's activities in exchange for some exposure. Even if it is only a partial solution, it's one every society owes it to their communities to pursue.

* * *

Scholarly societies need to acknowledge that open access is not the problem. They need to be honest about what the REAL costs of publishing are, and they need to be open to ideas that can reduce those costs. Open access is good for nearly everyone. So let's embrace it, and then let's work together to find ways to continue to support the great activities of the scholarly societies. The future may not work the same as yesterday, but that doesn't mean we can't make it work.

Identifying and Fixing Transcription Errors in Large Corpuses

2013-02-10T15:48:00.002-05:00

"Underwood 11 Typewriter", by Alex Kerhead.

This is the third post in my series on the Old Bailey Online (OBO) corpus. In previous posts I looked at the impact of courtroom reporters and editors on the vocabulary used in the Old Bailey trial transcripts, and at ways of measuring the diversity of immigration in London between the 1680s and 1830s.

Since I'm dealing with a huge amount of text (51 million words, 100,000 trials), I thought I'd turn my attention to the accuracy of the transcription. For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.

But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?

Well as you may have read in the previous posts, I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I've called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.

Not all of the words on the unidentified list are in fact errors. Many are archaic verb conjugations or spellings (catched – 1,657 uses or forraign – 1 use), compound words (shopman – 4,036 or watchhouse – 2,661), London place names (Houndsditch – 877), uncommon names that had not been marked up as such during the XML tagging process (Woolnock – 1), Latin words (paena – 1), or abbreviations (knt – 1,921) – short for “knight”, a title used by many gentlemen in the eighteenth century.

On the other hand, many of these words are clearly errors. We see mistyped letters as in “insluence” instead of “influence” or “doughter” instead of “daughter”. We also see transposed letters as in “sivler” instead of “silver”. And there are missing letters: “Wlliam” instead of “William”. Finding the difference between the real words such as “watchhouse” and the errors such as “Wlliam” amongst the 43,000 terms on the unidentified list is the real challenge.

Checking manually is impractical as these terms appear nearly 200,000 times in the corpus. Correcting every single error might not be worth the effort. However, to get an idea for the types of errors we see appearing and in what proportions, I checked every entry on the unidentified list against the image of the original scanned record during a single session of the court: January 1800. The unidentified words fell into the categories seen in Figure 1.

Figure 1: January 1800 Old Bailey Online transcription errors and the type of error.

The most surprising category here for me is the purple section, which showcases three instances that I would have categorized as typos by the transcribers, but which were actually typos in the original source. This compounds the problem because it means we must acknowledge that in some instances the error is not with the OBO team but is in fact reflecting the content of the contemporary document. From the perspective of a person searching for a particular keyword in the database they may be frustrated by the original error. On the other hand, from the perspective of those who want to be true to the original, that mistake should be preserved. I won't weigh in on that particular issue here, but it is something anyone working to correct transcription errors should consider.

With this in mind we can begin to look at the other categories, and by the looks of things approximately 40% of entries can in theory be corrected if we can figure out the intended word. Admittedly, I only looked at a single session of the trials, and this may not be representative - particularly if we consider Early Modern English, which might lead us to believe earlier trials are more likely to have archaic non-standardized spellings. If however the session from 1800 is roughly representative of a typical session then we should expect to find somewhere in the neighbourhood of 15,000-20,000 errors.

What can we do about it?

How can we automatically find and correct those errors? Given the fail-safes put in place by the double rekeying process, it's already incredibly unlikely that we will find typing errors by the transcribers. That means when we do encounter such errors it's likely only going to happen once or twice, meaning most errors are probably words that appear only once or twice in the corpus and that do not appear on either the dictionary list or the surname list.

That's not to say of course that just because a word appears in the dictionary that it is not transcribed incorrectly; however, at this stage it is much easier to identify those errors that are not recognized words. Unfortunately there are over 30,000 unique words on the unidentified list that appear only once, meaning this is still impractical to explore manually. Luckily the double rekeying means that any mistakes are more likely to be a matter of the transcriber interpreting the marks on the page differently than we might have liked them to than it is a case of fat fingers hitting the wrong key.

The early modern “long S” is the perfect such example. In the early modern era, up to about 1820, it was entirely common to find the letter S represented as what we might think looks like a lower-case “f”. This is the “suck” vs “fuck” problem that the Google N-Grams viewer runs into, as a slew of esses are interpreted as efs. When viewing the result one might be tempted to conclude people had quite a potty mouth on them in the early nineteenth century, as can be seen in Figure 2. Though not necessarily an incorrect assumption, it wouldn't be wise to make the assumption on this particular evidence.

Figure 2: Google N-Gram results for "suck" and "fuck" in the early nineteenth century

When we look through many of the words on the unidentified list it becomes clear that the Long S is a substantial problem. We find examples of the following:

abufes
afcertained
assaffin
affaulting
affize

Or, the other way around:

assair
assixed
assluent
asorethought
artisice

By writing a Python program that changed the letter F to an S and vise versa, I was able to check if making such a change created a word that was in fact an English word. When I did this I was pointed to several thousand possible typos. As I inspected the list further I noticed there were other common errors probably caused by the very high contrast scans of the original documents. These original documents often included missing parts of letters, difficult to read words, or little bits of dirt or smudges that made interpreting the marks more challenging.

Some of the most obvious switches were:

F / S
I / L
U / N
C / E
A / O
S / Z
V / U

Why these particular switches appeared again and again I'm not entirely sure. Some of them are easy to understand: the lower-case C and lower-case E are easy to mix up. Especially when a fleck of dirt shows up in just the right spot on the scan. Others are a bit more difficult to explain, as with U and N, which we wouldn't expect an automated optical character recognition program to have trouble with, but which seems to have stumped the human transcribers repeatedly.

By running these seven sets of letters through the program and testing the results against the English dictionaries I was able to come up with 2,780 suggested corrections. If these are all correct, that simple switching would correct 9,503 typos in the OBO corpus. The results of these changes broken down by letter-pair can be seen in Figure 3.

Figure 3: The number of suggested corrections in the OBO corpus by switching letter pair combinations in misspelled words.

I say suggested corrections because in some cases the switch is actually wrong, or may be wrong. The English dictionaries missed "popery", a common term used to refer to Roman Catholics in the eighteenth century and has instead suggested the unlikely "papery" as an alternative. In 86 cases the switching has come up with two possible suggestions, both of which are English words, at least one of which is obviously incorrect. The unidentified word "faucy" could be "saucy" or "fancy". Turns out it's saucy, referring to the behaviour of a Peter Dayley - that naughty boy.

This switcheroo method will not solve all problems. It cannot fix transposed letters, as with sivler and silver; Levenstein distance is likely needed for that. It does nothing for missing letters as in Wlliam. But it does take us well along the path to making some rather dramatic improvements with a very reasonable amount of effort, and I would argue, could be an economical way to improve the accuracy of projects which have already been transcribed but which suffer from accuracy issues. As with all great things in life this algorithm still requires a human's careful eye, but at least it has pointed that eye in the right direction. And when you're looking at 51 million words of text, that's nine-tenths of the battle.

If you're working on a project that could use some accuracy improvements, or have explored other ways of achieving similar results, I'd be very happy to hear from you.

Measuring the Diversity of Immigration using the Old Bailey Online 1674-1834

2013-01-17T04:46:00.000-05:00

"Mother's Wartime Passport -1941" A. Davey

This is the second in my series of posts on the Old Bailey Online (OBO) corpus. I've downloaded all of the trial transcripts from 1674 to 1834 (find out how on the Programming Historian 2), which is about 100,000 trials and 51 million words of text. In the last post I looked at the impact of editors and scribes on the vocabulary in the Old Bailey Proceedings.

This time I thought I'd look at something a little closer to my area of expertise: immigration to London in the Early Modern era. I've used the OBO heavily in my doctoral work on Irish immigrants, but that's been focused exclusively on the years 1801-1820, immediately following the 1801 Union of Irish and British parliaments. I've yet to take a longer look at immigrants across the centuries using the OBO and I thought this would be the perfect opportunity to do so.

This time I'll be looking at the "people words" extracted from the OBO corpus. As I mentioned in the last post they were identified by extracting all of the words that appeared between a set of "persName" tags in the XML version of the transcripts. This gave me just shy of 62,000 unique strings (referred to hereafter as "words") used to represent people. That's nearly half of all unique words in the corpus. Of those 62,000 words, most (55,000) are not found in the four English language dictionaries I used to identify English words. The remaining 7,000 are words such as "green" or "woman" or "the", which are used to refer to people such as "the woman" or "John Green", but which can also be used in other contexts (the woman's green hat). Not all of these words are therefore proper names; instead, they are words that have been marked up by the OBO team as a reference to a person somewhere in the corpus.

In Figure 1 you can see the rate at which these new "person words" appeared in the corpus.

Figure 1: Total number of "person words" found in the OBO corpus to date. [expand +]

Nothing particularly exciting here. It seems like most of the person names that are also English words appear very early on. It also looks like the number of unique words used to describe people increases at a steady pace throughout the long eighteenth century. From a cursory look at the list of names, it seems evident that many of these words are surnames.

Given names (first names) on the other hand, are much less common. That's because most early modern Londoners had pretty common given names (William, John, Elizabeth, or some variation thereof). Silly names for babies are largely an invention of twenty-first century Hollywood actors and football players.

While London is home to hundreds of thousands of people in the eighteenth century, it's safe to say the number of surnames people had in London increased over time as migrants flooded in from across England and beyond carrying new names with them. New surnames therefore have a few ways to end up in the corpus:

An established London family is mentioned in the record for the first time
An immigrant family with a new name arrives in the area and ends up in the record
Someone with a funny accent tries to say their name and it gets spelled phonetically

In the case of #1, it's entirely possible an established family (or anyone with that name) just avoided the Old Bailey for decades on end. I've managed to do so and there's no reason to expect others didn't too. However, common names shared by many people and local to Londoners should eventually show up. In fact, there's a reasonable chance they'll show up very early. We see this is in fact the case, as Smith, Wilson, Brown, White, and Allen all appear for the first time before 1680. "McCaffrey" on the other hand doesn't show up until 1834 and it's safe to say represents a name brought to London by an immigrant (either scenario #2 or #3 above).

New names arriving in the area may not be indicitive of the total number of new people who have migrated to London, but I do believe it reflects the growing diversity of immigrants arriving. Malcolm Smith and Donald MacRaild's article, "The Origins of the Irish in Northern England" shows that at least with Irish surnames, most names can be pinpointed to a particular region in Ireland. This regionality of names was still evident into the middle of the nineteenth century and will be no surprise to any genealogist who has sought out their family's past. We can see direct evidence of this regionality by mapping surnames. Great Britain Family Names allows you to see the distribution of any name in Britain in 1881. In Figure 2 you can see the distribution of "Howard" families, which clearly cluster around a few areas.

Figure 2: The distribution of "Howard" families in 1881 [expand +]

John Mannion agrees with Smith and MacRaild's conclusions about migration. In "Old World Antecedants, New World Adaptations" he argues that people tend to follow migration "channels". That is, someone from their village went before them and came home to say how great it was. Mannion was looking specifically at Irish migrants to Newfoundland and was able to show that villages who had already sent migrants to Newfoundland were vastly more likely to continue to do so than somewhere without the same history. That means the first "McCaffrey" in London was far more adventurous than the 351st. In fact, we could suggest that in many cases the first McCaffrey drew the other 350 over time by breaking the ice. I am interested in why that first McCaffrey decided to come to London, and what factors might have influenced his or her decision to do so.

The reason I think the OBO corpus is a useful set of records for monitoring this growing diversity of migrants is because I'm quite firmly convinced that the Old Bailey is the place one takes people they don't know when they've wronged you. I believe strangers (including immigrants and migrants) were much more likely to be subjected to the official justice system than were people with deep roots in the community.If a stranger wronged you, they had to be caught and punished quickly, or they might disappear. If your long-time neighbour steals your linen tablecloth, you have lots of options for how best to deal with them. You could smack him, you could set the dogs on him, you could knock on the door and demand it back. You had options, and time, because you knew he would be there tomorrow. And the next day. You don’t have that same assurance with someone you've never seen before. And that meant I believe people were more likely to seek a legal response than to choose a community resolution when dealing with strangers or those they do not know very well. In John Beattie's wonderful book, Crime and the Courts in England: 1660-1800, he appears to agree:

In the small-scale society of the village a prosecution may not have been the most effective way to deal with petty violence and theft. Demanding an apology and a promise not to repeat the offense, perhaps with some monetary or other satisfaction, may have been a more natural as well as a more effective response to such an offense, or perhaps simple revenge directly taken (Princeton: Princeton University Press, 1986, p. 8).

If I am correct in my assumption then immigrants are more likely to appear in the Old Bailey records than established members of the community (at least as a defendant), and are even more likely to do so shortly after they arrive in London as opposed to several generations later. That means that there is likely a reasonably strong connection between the date a name first appears in the OBO corpus and the date that name first appeared in the London area. It may not be a precise way to measure the arrival of new names, but I would hasard to say that in most cases it's probably accurate to within a few years or a decade at the most.

Therefore, one way to find new families with few if any connections to the locals arriving in the area is to look for the first time a given surname appears in the records. Considering the nature of the Proceedings, most names that appear in the record likely refer to people in London as opposed to strangers living far away. That's not always going to be true but for the most part it's a reasonable assumption. That means by measuring the rate at which new names appear in the OBO, we should get a reasonable if rough idea of the rate of immigration from distinct family groups over time.

To isolate surnames I've taken all of the 60,000 names present in the London area in the 1841 census and checked them against the "person words". This returned a list of just over 20,000 surnames. That means one third of all unique surnames in London at the end of the period have showed up in the Old Bailey corpus at some point or another. I'm sure I've missed a few, particularly those spelled phonetically, but this is probably a pretty good start.

You can see when each of these names first appears in the corpus in Figure 3.

Figure 3: Number of new surnames in the OBO corpus per year [expand+].

What Figure 3 shows is that the rate of new surnames arriving is actually fairly stable over the course of the eighteenth century. As mentioned in the previous post, the big dip around 1700 is caused by missing data and very short trial accounts, and we might be wise to assume that the entries around 1715 should actually be lower if we had the full set of trials as more names would appear earlier, filling in the gap. The long slow decline therefore over the course of the eighteenth century might actually be better understood as a reasonably flat line hovering around 100 new surnames per year and declining slightly towards 50 or 60. But it turns out that is not the whole story, and the clue to that is the increase in new names in the years immediately following the Napoleonic Wars c. 1815.

After the fall of Napoleon at Waterloo it seems quite clear that there's an influx of new surnames into the London area. I've got a suspicion that the cause of this influx is decommissioned soldiers and sailors who were dumped in London (or found their way there) after the war and got themselves into trouble. War collects soldiers and sailors from far and wide and brings them together. When those soldiers and sailors are no longer needed they're released to go on with their lives.

It would seem that after two decades of war enough people had been uprooted from their native regions by this process for a long enough period that they felt no inclination to go back home. Instead some of them obviously resettled in London or the growing industrial cities in the north, which seemingly offered greater opportunities or were more germane to their skills than the family farm. In fact, many people may have found themselves without a farm to go back to, since the enclosure movement had been consolidating ariable land into much larger units throughout the second half of the eighteenth century, leaving many people landless. That landlessness may have attracted them to the army or navy in the first place, and now with military life behind them they had to find something else to do with themselves. London, it would seem, was it. At least for some.

This trend of more new names after the Napoleonic War doesn't appear to be an isolated incident; it's merely the most obvious case. Instead we see similar patterns in other major wars and domestic conflicts involving the British, the results of which can be seen in Figure 4.

Figure 4: Number of new surnames in the OBO corpus per year, colour coded to highlight periods of war and peace [expand+]

Figure 4 shows the same number of new surnames per year arriving in the OBO corpus, but this time is highlighted to show some of Britain's major wars and domestic conflicts in the latter-half of the eighteenth century. The wars depicted in red are:

The Seven Years War (1756-1763)
The American Revolutionary War (1775-1782)
The French Revolutionary Wars (1793-1802)
The Napoleonic Wars (1803-1815)

While the American Revolution wasn't officially settled until 1783, it was effectively over by the end of 1782. The grey bar between 1802 and 1803 separates visually the two wars with the French.

The black bars represent years in which significant domestic conflicts took place:

The Jacobite Rising of 1745 (1745-1746)
The Gordon Riots (1780)
The Irish Rebellion (1798)

In nearly all cases we see a decrease in the number of new names showing up shortly after a war or domestic conflict erupts. This is most evident for the Jacobite Rising of 1745. The apparent dip in migration at this point makes sense; who wants to move when there's a rebellion going on? This dip is then followed by lower than average numbers of new names until the conflict ends, at which point almost invariably the following years experience an above average result. This is evident both for domestic conflicts as well as international wars. The pattern appears again and again.

The differences between the average number of new surnames per year during war compared to the average in the five years after the end of a war are in fact statistically significant, at least for the American Revolution and the combined French Wars (paried t-test: p = 0.0418, and p = 0.0001 respectively. Significance in this case was p < 0.05). The Seven Years War does not pass the statistical t-test (p = 0.1346), However, I am confident we are seeing evidence of the same trend. While my statistical skills are rather rudimentary, I think it's worth noting that failing a t-test does not mean something is not true. Instead, it suggests the numbers alone cannot support that conclusion beyond all reasonable doubt. For me, the fact that the latter wars are so obviously following this trend strenghtens my confidence in a similar trend for the Seven Years War and we can see this in Figure 5, which shows the average number of new surnames per year during and after the three wars.

Figure 5: The average number of new surnames per year during the wars and in the five years following the wars [enlarge+]

The strength of the correlation between these conflicts and the decrease in new names, followed by an increase in peacetime suggests to me that my original assumptions about newcomers getting in trouble with the law were correct. It also suggests some interesting things about migration patterns in the eighteenth century. That is, people migrated when they felt it was safe. During times of turmoil, they stayed put and waited things out.

There are implicitly two groups of people here, so each requires its own discussion, I think. Firstly there are the sailors and soldiers. The reason we don't see these people arriving in London during wartime is perhaps obvious: they were in the employ of the state, off fighting the enemy. Gathered from across Britain and Ireland, as mentioned above, when they were decommissioned they had the opportunity to move where they liked and it would seem some chose London. This may have disproportionately included sailors who may have hoped to find work in London's booming shipping business.

The second group are families or individuals who have decided for economic reasons to move to London. Since we don't have evidence that someone with that name lived in London previously, many of them are presumably amongst the first of their stock to try their hand at London living. This in itself should not be taken lightly, as moving to early modern London without a social support network was an incredibly lonely and dangerous prospect, which is why so many migrants failed and found themselves in gaol, or starving and desperate, looking for any chance to get away. Sadly, we see many immigrants like Sarah Holmes, who claim that they "have no friend but God" as they throw themselves at the mercy of the courts.

What does all this mean? What can we learn about these arriving surnames? War and domestic conflict are not the only variables at play here, but I think it puts forth a reasonable case for the effects of war and peace on migration patterns of those moving towards London in the long eighteenth century. Returning to the idea mentioned earlier about the first McCaffrey (or the first of any family), it seems that families were only too willing to bide their time during periods of war, waiting instead for peace before making their way to a new life in London. We can see why this strategy might have been appealing. Why take a risk when the country is at war?

Unfortunately it may have been the wrong approach from an economic standpoint. According to Ball and Sunderland's "An Economic History of London, 1800-1914", the gap between real wages and cost of living peaked just after peace was called with France in 1817 (p. 95). That means people were most desperate when the government realised it actually had to start paying for the war it had waged. It seems to me likely that the two trends are actually connected. As new families arrived they may have been desperate for any work, forcing down the price of labour in London as a surplus of workers vied for jobs. It may seem counterintuitive, but these data suggest it's best to move during war rather than after.

Nevertheless, this unlikely set of criminal records has provided, I think, an interesting window into wider migration strategies across the eighteenth century. And it came about not by looking at how many people arrived, but when unique groups of people likely first emerged. London it would seem was home to an increasingly diverse population. That population continues to diversify to this day. And though I'm sure there are some Brits who might see war as a strategy for keeping the net migration in the "tens of thousands", take heed, for when peace comes, so too will the immigrants.

Whose Lexicon? The Impact of Reporters and Editors on the Old Bailey Proceedings

2013-01-09T09:04:00.003-05:00

A few weeks ago I was discussing early modern vocabulary with Tim Hitchcock (as one does on a Wednesday evening). If I recall correctly, he felt that new words likely appeared at roughly the same rate as old words disappeared from the language. In essence, we're not getting a bigger vocabulary, we're just using an ever-shifting one. Ben Schmidt's blog posts on "Age Cohort and Vocabulary Use" and "Predicting Publication Date and Generational Vocabulary Shift" would tend to support this idea. The basis for Schmidt's article was an analysis of the age of authors and how their age impacted the way they used words in 19th century literature. Schmidt found that people learn to use language in a certain way in their youth and tend not to change those patterns very much as they age. This accounts for the slightly different languages whippersnappers and their grandparents speak, even to this day.

I decided to see what I could find out about vocabulary use by digging into the Old Bailey Online (OBO). As many of you undoubtedly know, the OBO is a wonderful corpus of electronic text for anyone interested in Early Modern London. The OBO is an electronic XML version of the Proceedings of the Old Bailey, an abridged transcription of what was said in court for each case held in the Old Bailey courtroom between 1674 and 1914. What we have is not an exact facsimile of every word spoken, but what Magnus Huber believes is “guided by” what was said in court, capturing the ideas, if not always the exact words of the speaker. Though not a perfect transcription of speech, Clive Emsley believes that we can put our faith in the events described in the Proceedings because “the Old Bailey Courthouse was a public place, with numerous spectators, and the reputation of the Proceedings would have quickly suffered if the accounts had been unreliable”.

Originally intended as a profit-making venture, entertaining the masses with tales of woe in the courtroom, the Proceedings became the official record in 1778 and were required to present a “true, fair and perfect narrative”. Practical limits of course made this difficult. The Proceedings were created entirely without electronic recording devices by shorthand reporters. Many trials therefore appear in significantly condensed form, such as the six-hour trial of Charles Stokes and company from 1787 that is recorded in only 468 words in the published version. Unfortunately, I do not have Huber's annotated version of the OBO corpus. I do however, have the full set of XML files, downloaded from the Old Bailey Online website (to learn how to do this check out the Programming Historian 2). I decided to focus on the records between the beginning in 1674 and the foundation of the Central Criminal Court in 1834. That gave me 161 years worth of early modern transcripts for just over 100,000 trials and 51 million words.

What I found was that even with such a wonderful resource we cannot be sure what people said to one another and how closely those speech events relate to the written records we have left. None of us knows how to speak like an eighteenth century Englishman, and based on my digging I don't think that the OBO can teach us how. That's because the vocabulary used in the OBO is the vocabulary of the people who recorded and edited the trial transcript and not of a wider communal lexicon. Instead of looking into vocabulary, I quickly realized I needed to be thinking about vocabularies. When it comes to the original assumption, we're not getting a bigger vocabulary, we're just using an ever-shifting one, the real question is: who is "we"?

Before coming to this conclusion, I looked at the rate of unique words entering the corpus over time. When I calculated this there were more than 120,000 unique words in the corpus, the introduction of which can be seen in Figure 1, showing a surprisingly linear increase. I should note that I'm defining a word as a unique string of characters. "word" and "words" are two different words for my purposes - that is, the corpus was not lemmatized.

Figure 1: The size of the lexicon int he Old Bailey Online [enlarge+]

Since I work on immigration to London this got me wondering if what we're actually seeing is not necessarily a growth in the vocabulary, but a diversification of names present in the metropolis. Trial transcripts typically discuss people, so we do get a lot of names popping up. New names can come from new people arriving in the city with unique names that no one else had, or when someone pronounces their name with a thick enough accent that a phonetic variation appears in the corpus as a new word (Callaghan, Calaghan, Callagan, Colligan, Calahan, Callaham, Callahan, Calleghan). Early modern parents were still largely not very adventurous with their names so we don't see a lot of children named "Apple" or "Harper Seven" and so creativity is not likely going to be a big factor driving the growth of new names. The OBO XML tags make this relatively easy to test. The "persName" tags allowed me to identify and extract all words used to describe people. As it happened, there were roughly 60,000 of these such names - about half of the entire lexicon.

I also decided to check if the new words represented use of English words, or if they were archaeic spellings, acronyms, or otherwise misspelled words, so I ran the entire set through a series of four English-language dictionaries to see if we were dealing with recognizeable words. These dictionaries included 90,000 unique "words", including lemmatized variations. Each word could therefore be a "dictionary word", a "name word", a combination of both, or neither.

The results can be seen in Figure 2, which shows what category the new words fall into, graphed over time along the same scale on the y-axis.

Figure 2: The size of the lexicon at each trial session, broken down by the category of word [enlarge+]

The results are, I think, interesting. I'll discuss the "Names List" and the "No List" entries in future posts. The bottom left graph is words that appear both as names and as words in the dictionary. This includes words such as "green", which can be both a person's name and a word describing the colour of something. I've decided not to disambiguate between the two on a word-by-word basis for time reasons.

For now, I'd like to focus on the dictionary terms. Dictionary words seem to be on the rise throughout the eighteenth century. The reason for this may in fact be a slight growth in the lexicon, but I think more likely is a tendency towards increased standardization in the spelling of English words. Samuel Johnson's Dictionary appears in print for the first time in 1755, so this is the age of standardization in lexicography. As anyone who has ever tried to read a seventeenth century text knows, people spelled (spelt?) words differently back then. Over time the "accompts" of criminal activity transform into "accounts". People stop committing "burghlary" and are instead charged with "burglary". This of course occurs shortly before the last "souldiers" are sent off to war.

My four "dictionaries" were built for modern use rathern than seventeenth or eighteenth century vocabulary, which means the figures above are a better indicator of when standardized spelling of English words were adopted than they are measures of the lexicon. "Burghlary" should really be counted as a variation of "burglary" rather than as a unique word in the "no list" category. A linguist would tell me that I should have lemmatized my corpus.

But don't dispair, not all is lost. I think we still can learn a thing or two about the lexicon as well as the OBO records themselves from this analysis. In Figure 3 you can see the number of new dictionary words per year introduced into the corpus.

Figure 3: The number of new words per year introduced into the OBO corpus [enlarge+].

The number of new terms is highest in the early years. This makes perfect sense as the corpus size starts at zero on the date of the first trial. Before a word appears in the corpus someone has to use it, and that takes time. The big peak in 1689 is an anomaly caused by a single account that was much longer than typically found at the time. Most big peaks can be traced to particularly long accounts, since these generally represent a reporter going into much more detail and therefore using a wider vocabulary. The dip in the early years of the 18th century represent a series of particularly short accounts, as well as some missing accounts. Where it all gets interesting for me is at the first arrow around 1715.

What we see at this first mark is a fairly high number of new dictionary words appearing each year until about the 1740s. The number of new words around 1715 may be artificially high, since we do seem to be missing entries in the previous decade and presumably some of those new words would have appeared earlier if we had the records. Nevertheless, there does appear to be more new words than normal in the following two decades. Perhaps surprisingly, the publication of Johnson's Dictionary at the second arrow marker, is actually a low point for new word growth in the middle of the eighteenth century. This to me suggests that Johnson was in many respects responding to generally accepted norms of spelling and word use rather than driving the adoption of such uses.

The last arrow is for me the most interesting. The number of new words per year again increases significantly just after 1778, which as I mentioned above was the date that the Proceedings of the Old Bailey became a "true, fair and perfect narrative" - an official record of courtroom activity. Perhaps this shift from a popular to an official record meant a significant change in what went into a trial account. That does seem to be part of the answer. The length of the Proceedings does slowly start to increase, starting in 1783 when the graph jumps upwards.

The fact that there are spikes in new words every time a long transcript appears reinforces the fact that the Proceedings are using a specific vocabulary - one related to criminal justice - as opposed to a vocabularly that's representative of the entire active English lexicon. But the spikes in new words, as well as the number of words in a trial transcript, can tell us even more, if we look at who was writing those words down.

As mentioned previously, Magnus Huber believes the Proceedings are "guided by" what was said in court. Before a spoken word appeared in the Proceedings the original speech event was converted to shorthand by a courtroom reporter and were converted back to prose by the workers in the print shop before being committed to paper. Words represent the attempt of one person to communicate an idea to another. As above, "burghlary" and "burglary" refer to the same idea. The difference between the two spellings is merely a choice in how to record the sounds using letters.

So what effect does changing the scribe have on the rate of new dictionary words entering the corpus? Apparently, quite a lot. I've located the names of the scribes from 1749 onwards in Huber's article, "The Old Bailey Proceedings, 1674-1834: Evaluating and annotating a corpus of 18th- and 19th-century spoken English". When we look at the number of new dictionary words each scribe introduces into the corpus (Figure 4), we see it's certainly not even across the board.

Figure 4: The number of new dictionary words per session added to the corpus, coloured by courtroom reporter [enlarge+].

I recognize there's a lot going on there, so I'll break this down into chunks that are easier to see. What I'd like to draw your attention to is the fact that some scribes increase the size of the corpus significantly, and others do not.

Firstly, let's look at Hodgson, who was holding the transcribers pen from 1782 to 1792 (Figure 5).

Figure 5: The number of new dictionary words in the Old Bailey corpus from 1781 to 1795, coloured by courtroom reporter [enlarge+].

Though W Blanchard was only on the job for a few months, it's quite clear that E Hodgson was on average adding more new words to the corpus each month than had his predecessor. The number of new words Hodgson adds starts slowly, but then picks up rather dramatically before tailing off towards the end of his tenure. Hodgson's "reign" so to speak also overlaps with the significant growth in the size of the Proceedings mentioned above. From the graph and the trendlines, we might make the following conclusions:

Hodgson was verbose and reported more than his predecessors
He had a larger than average vocabulary that he happily shared
After a few years his "used up" his vocabulary and ceased to find as many new words

Hodgson has therefore influenced both the vocabulary used in the Proceedings as well as the length of the resultant documents. We would be naive therefore to suggest that Hodgson was an impartial observer or that the Proceedings during this period are anything but the output of Hodgson himself. The records were not "created"; they were created by Hodgson. The way Hodgson created the records was distinct from how the others did so.

Moving forward slightly in time, let's consider the next three scribes who wrote between 1792 and 1801 (Figure 6).

Figure 6: The number of new dictionary words in the Old Bailey corpus from 1792 to 1801, coloured by courtroom reporter [enlarge+].

The effect of different writers here is perhaps more obvious. Silby doesn't seem to be one for new words, whereas Marson and Ramsay in the blue definitely are. The fact that Ramsay stays on afterwards and the growth of the vocabulary is stunted thereafter suggests that it was Marson driving this change. It's becoming clear that anyone working with these records should be wary of who was responsible for creating them in the first place.

The last section I'd like to highlight is the final one from 1816 to 1834 when a single scribe named H Buckler was on the job. However, Buckler worked under a series of editors as can be seen in Figure 7.

Figure 7: The number of new dictionary words int he Old Bailey corpus from 1816 to 1834, coloured by editor [enlarge+].

In this case, the scribe stays the same yet we still see patterns that seem to make more sense when we know there's a different editor publishing the Proceedings. Clearly when Stokes takes over in 1828 there's a change to the resultant lexicon that spikes up, presumably after he had become confident in his new role after a few months on the job. This set is particularly interesting because from what we can tell, the scribe H Buckler doesn't seem to be the one driving the adoption of new words. From 1816 to 1828 he's one of the less ambitious in this category, but that begins to change as new editors take over.

Conclusions

How does this all tie back with my origional questions about vocabulary in the early modern era? Well, first of all, I failed to test what I set out to understand. Because I did not lemmatize my corpus I was not able to determine if we do in fact have a growing vocabulary or a shifting one. What I should have looked at was a moving window of word use, calculating how many words were used in a given ten-year period. I'll leave that for another day or another researcher to take on if they feel so inclined. My suspicion is that we have both a growing and shifting vocabulary. Words are falling into disuse or at least out of regular use. I imagine that Ben Schmidt's analysis explains most of that shift: young kids don't learn - or at least don't use - the same words as their parents or grandparents. Words die one funeral at a time.

The reason I wasn't able to see a shifting vocabulary using the OBO corpus is because the OBO does not represent a single person's vocabulary. Instead, it roughly represents the combined vocabulary of people who appear one way or another in the Old Bailey courtroom to discuss matter of law and justice, filtered through a courtroom reporter and an editor. As I discovered, those last two have a much bigger impact on the corpus than we might have liked to imagine. For anyone who looks at the language of the court or indeed the format of the transcripts by using the OBO corpus, I'd urge you to keep the impact of those reporters and editors in mind. For anyone studying academic history, do note that what seems like a window into the past, is in fact the product of a few individuals who made decisions they may not even have been aware of that impacted what was recorded, what was not, and what words they used to do so.

Crymble Awards: Digital Humanities & History Best of 2012

2012-12-31T08:03:00.000-05:00

In keeping with the tradition I established last year, I thought it fitting to again take a moment on the last day of the year to acknowledge some of the people and projects that inspired and influenced my academic growth the most in the past year.

These days there is much talk about the impact of research - particularly in the UK with the impending "REF" that has academics across the country scrambling to demonstrate that they are "world class". We're always looking for ways to quantify who has the most influence, and unfortunately that's more often than not meant counting up citations. But the people I cite are not always the ones who have given me most cause to think, or those whose efforts I've appreciated the most. And that's where the Crymble awards are important for me. They're a chance to acknowledge that which often goes unacknowledged. And a chance to challenge the notion that a footnote is the only worthwhile measure of success.

Last year the awardees included six men who worked on five projects: Tim Hitchcock, William J. Turkel, Tim Sherratt, Ben Schmidt, Sean Kheraj, and Jeremy Boggs. Though all six continue to produce inspiring work, I've decided to exclude past winners. I will however stick with five projects as the magic number of awards. And this year I'm pleased to say the gender imbalance is improving - if only slightly.

In no particular order, I present this year's winners:

Julia Flanders, "Faircite : towards a fairer culture of citation in academia"

I emailed the journal Digital Humanities Quarterly back in January to suggest that the journal adopt fairer citation practices that would see a wider range of team members on collaborative digital humanities projects credited publicly for their work. The response I got from Julia the editor was astounding, full of energy and enthusiasm. She encouraged me to pursue the matter further, gave the idea a catchy name, and even helped put together a formal proposal for the Alliance of Digital Humanities Organizations which she supported and presented. I'm happy to report a number of projects have responded positively to Faircite and have begun offering more inclusive suggested citations that acknowledge the work of their team members - a trend I hope to see continue to grow. (See the Old Bailey Online, and Voyant Tools for examples). Thanks to Julia for her support.

Luke Blaxill, "Quantifying the Language of British Politics 1880-1914" Paper presented at King's College London, 2 November 2012.

Luke's a (recently) former colleague of mine at King's College London. His work ties together corpus linguistics with historical inquiry in a way that's so simple, yet so effective. What I love about Luke's research is that his distant reading approach to history has so effectively challenged conventional historical wisdom in a way that close reading could never highlight. I've incorporated many of the principles and tools I learned of through Luke into my own research. You can hear a version of his talk on the IHR History Spot archive. For those of you looking for a digital humanist with some great textual analysis skills, you'll be happy to hear Luke has recently received his doctorate in history and digital humanities, and I'm sure he'd love to hear from you!

Peter King, "Ethnicity, Prejudice and Justice, The Treatment of the Irish at the Old Bailey 1750-1825" (under review).

It's always a little frustrating to find someone's just published the very topic you had been working on. But in this case, the experience turned out much better than I could have hoped for. Peter King is a Professor of History at Leicester University and very willingly shared his pre-publication work with me, has answered questions, provided advice, and even made a trip down to London to watch me present a response paper to his work. I think it's fair to say he has been amicably combative about my work, and has pushed me to continue to improve.

Andrew Marr, The Open University et al. "Andrew Marr's History of the World" BBC One. Television Series: October-November 2012.

This one may perhaps look like the "one of these things just doesn't belong here" entry. Andrew Marr is a BBC journalist who presented an eight-part "world history" this autumn in the UK. I found myself unable to get enough. The team behind the series did an amazing job of finding specific examples to illustrate broader themes that captured what was unique about entire civilizations. It forced me to consider the scope of my own piddly 20 year local study and how that fits into the broader spectrum of human history. Kudos not just to Andrew Marr, but also to the Open University and everyone involved in the project. I'm sure there were many of you.

Fred Gibbs and Miriam Posner "The Programming Historian 2"

Fred and Miriam joined the Programming Historian 2 team of which I'm a part right at the beginning and have been been invaluable to getting the project off the ground. Miriam has taken on the role of our outreach officer, and Fred one of our general editors. Thanks very much to both of them for all their hard work on the project. (And though I said I wouldn't re-award past winners, I feel compelled to mention that William J. Turkel and Jeremy Boggs are also instrumental team members!).

Thanks to this year's winners for all your inspiration.