Saturday, August 31, 2013

Applications open for Five Solutions: Digital Sustainability for Historians


Five Solutions to What?

Historical scholarship is increasingly digital; and yet we do not have an agreed form of best practices for ensuring that digital scholarship lasts. Five Solutions is looking for five scholars able to outline a solution to the issues of sustainability now facing historians. This one day workshop asks participants to give a 15 minute presentation outlining practical solutions to one of five challenges, with the resources and expertise of an ordinary working historian in mind.  These presentations will form the basis for a one day workshop on practical strategies for digital sustainability.  The presentations can be based on your own experience and ideas, or can be taken on as a research project. We will work with all participants to ensure that the final presentations are both technically workable and illustrated with the most appropriate datasets.

Accepted participants will each receive a £350 honorarium.*

The Five Themes

The following five themes are designed to get you started, but if you have other ideas, we’d love to hear about it. Each theme should be approached with the ordinary working historian in mind.

1.     Preserving research data for the future
2.     Curating an enduring professional online persona
3.     Paying project costs after the money runs out
4.     Capturing and documenting the expertise of temporary staff
5.     Strategies for working together on larger projects

Who Should Apply?

We’re looking for people with passion. Scholars old or young, university students of any level, librarians, archivists, developers, designers, system administrators, or anyone who considers themselves a historian at heart. No specific qualifications or prior experience required - just an interest in helping academia find solutions to organizational and technological challenges facing the sustainability of our digital projects.

What do I have to do?

Figure out a solution, of course! Once you’ve come up with your solution, you’ll share your work in two ways:

1.     A 15-minute presentation of your solution at a one-day conference in London, UK on the 28th of November 2013 at the Institute of Historical Research.

2.     A 1500-2000 word peer-reviewed tutorial outlining your solution to be published in the spring of 2014 in the Programming Historian 2 and distributed as part of ‘IHR Digital’.

All tutorials will be peer-reviewed and released under a Creative Commons CC-BY license. Participants will have the full support of an editor at the Programming Historian 2 who will provide guidance for writing an effective, practical tutorial.

Evidence of previous work with technical writing or a willingness to learn, as well as a strong command of the English language are a bonus.

How do I apply?

By 8 October 2013 send a two-page C.V. and a brief email to (subject line: Five Solutions) addressing the following questions:

1.     What theme would you like to tackle? (Use one of our suggestions or come up with your own.)
2.     Give us an idea of how you plan to solve this issue, or where you intend to look for a solution (max 200 words)
3.     What skills or experiences make you the ideal person for the task?

We apologize in advance, but we are limited to five scholars.

* Our funding restrictions allow honorariums for UK-based participants only, though we are happy to receive applications from those abroad who have access to their own travel funding and who would like to participate.

Project Support By
And by the AHRC Theme Leader Fellowship for its Digital Transformations Theme.

Monday, August 5, 2013

Can We Reconstruct a Text from a Wordcloud?

We’ve all seen Word Clouds. Many of us have even wondered if they’re of any value. I have used word clouds in the past; I find them useful in presentations when I want to highlight the relative importance of certain words over others. For example, I often use this word cloud to the left, to show the most common Irish surnames in the London area during the early 19th century. I hope my listeners will note that Murphy or Sullivan is more common than Burke or Foley, without me having to take the time to explain the connection between word-size and significance.

I’ve also used word clouds in analysis. In a previous post I discussed how I was able to use the below word cloud to show the relative frequency of topics found in the Gentleman’s Magazine between 1800 and 1820, which allowed me to get a pretty good idea of what the gentry and the middle class were interested in during that period.

I think both of those uses for word clouds have been productive. They’ve allowed me to transmit ideas, and formulate my own thoughts on a set of data in an effective manner. But I began to think about other uses, and I began to wonder about the process of getting back to the original data. Word clouds take the individual words (tokens) out of context. As I mentioned in my last post, we think in metaphors, or ideas. Not in words. That means a word cloud reduces a single idea such as “green bowl” into two tokens “green” and “bowl”. It then combines the word “green” into a single graphic based on how often it appears in the text. The program does not take into consideration the fact that “green” as it refers to a bowl is entirely different than Mr. Green or Green Park. An article about Mr. Green’s picnic in Green Park with his favourite green bowl might give you a skewed idea about the importance of the word green, here representing three completely different ideas, and in all three cases simply acting as modifiers to more important concepts (a man, a park, and a bowl).

Just for fun, I decided to do a test. I asked 4 colleagues, all experts on the criminal trial transcripts of the Old Bailey Online, to look at word cloud of a trial. Each person was asked to describe what key information they could tell me about the crime. I was interested in knowing if they could tell me the who, what, when, where, why type details, and if they could reconstruct the basic building blocks from the prevalence of certain keywords. In the spirit of exploration, I played along as well and offered my own interpretation.

The word cloud was created at random by my wife without my knowledge of the trial. All I knew (and all the participants knew) was that the trial took place between 1801 and 1820, and was between 2,000 and 3,000 words long. The word cloud was limited to 75 unique words and common English words were removed. The resultant visualization looks like this:
I have colour-coded my assessment (blue) for aspects I got correct and (red) for the bits I got wrong. What struck me immediately was that this was a case involving theft. That’s a safe bet anyways, since about 50% of Old Bailey trials during this period were theft cases. It was the large number of nouns that led me to this conclusion. I know that trial transcripts always list the items that were stolen, and the testimony in the trial almost always discusses the various objects repeatedly as several witnesses are called to give their account of what transpired. In this case, I’d assume there was a large quantity of alcohol that went missing, ranging from red wine, to port, to gin – stored in bottles, measured by gallons, and in at least one case: a cask.

At the time it was stolen the booze was being stored in a cellar before it was transferred to a cart that was being driven by a horsea mare to be specific. Why Restoration actress Nell Gwyn appears in the set of words, I have no idea since she died over a century before this era, unless the name is a coincidence or perhaps refers to the name of a pub that lost its liquor.

There were a large number of people involved with giving evidence against the defendant including Messers Hutt, Wells, Powell, Wood and Bagnigge, as well as possibly a Mr. Limbrick, and definitely someone named Hart – though again that may be the name of the pub. One of those men is likely the watchman and another an officer. Based on what I know about Old Bailey trials, this suggests there were a lot of witnesses, meaning the prosecutor was concerned that his case may not have succeeded.

The alcohol heist took place in the morning (and was perhaps discovered the following night), and the goods were then transferred down either Maiden-lane or City-road. Given the volume of goods stolen and the fact that death appears in the list suggests our defendant was found guilty and sentenced to death.

My conclusion: a pub named either Hart or Nell Gwyn located on City-road or Maiden-lane was robbed of a large volume of alcohol by a solo male defendant, who was found guilty and sentenced to death.


You can compare my assessment with the full trial transcript. It turns out I wasn’t that far off. There had been three defendants, but they were found guilty and sentenced to death. It had been a large alcohol theft. I wasn’t able to accurately pick out the fact that Messers Wood, Powell, and Hart were the defendants, meaning the who aspect of the challenge had completely eluded me. I also didn’t recognize Bagnigge Wells, which was the location of the crime, not someone’s name. Nell Gwyn in this case was the pub that had its door pried open to reach the alcohol.

Participant 1

“Powell and Hutt were found guilty of breaking into the wine cellar of the Nell Gywnn inn in Bagnigge Wells (I know about this place) They were accused of stealing two gallons of wine, casks of gin and bottles of port from the cellar  belonging to Mr Hart . A Watchmen Mr Wood on his round at 1 o'clock saw a broken iron (lock)on the wine cellar door  and saw two men drive off in a horse and cart down Maiden lane towards City Road and immediately called for an officer. The men were stopped by an officer who examined the cart and found the cask of wine, gin and port belonging to Mr Hart and arrested them. They received the death penalty.”

Participate 2

“I would guess it involves stealing a hamper of goods including a bottle of wine from the back of a cart.  My suspicion is that the defendants were two women, and that the servant of the person who owned the cart/hamper discovered the theft and was called in evidence, though the actual owner wasn't.  The cart was on its way to or from northwest London to banigge wells, for a social occasion and some visiting, it happened at night (suggesting they were returning home) and there was a runner or 'officer' involved in the arrest.”


In these examples, the participants tried to be very specific about the details of the transcript, and in doing so were incorrect more often than not. The basics of the case, including the location of the crime, was however, correct for participant 1. This would suggest that expert readers are able to get some of the basics – though by no means all. However, that expertise does little to bring forth the specifics of the trial.

Participant 3

“This trial seems to have a richer vocabulary than most.  It looks like a theft case from a wine cellar (wine, port, cask, gin, gallons, hampers, bottles, etc. suggest as much), presumably at Bagnigge Wells; actually the prominence of the word cart, and also the word horse, suggest the material might have been in the process or coming or going there, perhaps parked on the City Road (or Maiden Lane). Went suggests action.  Some force was used: broken, crow, saw, which suggests that the cart was broken into.   There is a certain amount of vocabulary indicating how the culprit might have been apprehended: officer, stopped, examined, watchman, observed, charge—this suggests that officers were used to apprehend the defendant. There are some names: Powell, Gwyn, Hart, Mr, Nell, William.  The most frequently mentioned, Hart, is presumably the victim.  Timing is important: o'clock, morning, night—the prominence of night suggests that that is when the crime took place, with the suspect arrested in the morning?  Numbers indicate either the number of items stolen or the time of day (one or two in the morning).


This one came out surprisingly accurate. The participant hadn’t recognized Nell Gwyn as the name of the actress. As in my own case, it proved impossible to figure out who was the defendant and who was the victim. However, the details of the crime and the process of apprehending the defendant is almost bang on. This participant didn’t try to reconstruct the narrative in the same way as Participant 1 or 2, and thus avoided many of the pitfalls. However, there has been no guess at the verdict, and while the basics of the trial are here, the richness of what actually is recorded in the transcript is nearly entirely lost.

Participant 4

“Geographical location
Wine Cellar –property crime
Bagnigge Wells – recognise this as the location of a rather seedy spa/pleasure
grounds on the outskirts of Clerkenwell towards Kings Cross.
City Road – not too far away runs from the City to Islington
Maiden Lane – small street that runs parallel to Covent Garden Piazza on the
south-side although there might be other Maiden Lanes
Time –mention of time o’clock and watchman (usually worked only at night)
and night although mentions morning (in the or the next)
Crime probably burglary because of the mention of the relevance of time
together with watchman
Stolen Goods: bottles of red wine, port, cask, gin from wine cellar
Broken into cellar with an iron bar, maybe crow bar
Probably took it away or hid it in a cart, certainly it is central to the plot and
perhaps discovery of the stolen goods. There may well have been a significant
amount of drink gallons, casks and bottles? Hampers – might have been used
to transport/hide the goods
Arrested – yes (examined) and officer (probably from one of the Police Courts
Observed - spotted
Names Hart – think this is a personal name rather than pub sign and possibly
Wood as it’s mentioned frequently but not as much as the things I think were
actually stolen
Gwyn –forename Welsh? Ha – just spotted nell (same size) and I know that
Nell Gywn was supposed to have lived at Banigge Wells. John, William, think
Hutt is a personal name (not a shed)
Death –punishment (so guilty)”


This one was particularly interesting, because the participant included their thought process as it related to the various words on the visualization. The fact that it was clearly a theft of goods, and that time was mentioned, lead this person to conclude we were dealing with a burglary, which was correct.  The conclusion that Hart was the name of a person rather than a pub was correct, but equally could have been wrong. While Nell Gwyn may have lived at Bagnigge Wells, that wasn’t really relevant to this case.


Can an expert on a historical source reconstruct the details of that source from a word cloud? It would seem the answer is: sort of. Of the five participants, two (#3 and #4) did an incredibly good job of getting some of the details. These people were able to reason out what the words meant by drawing upon their experience with the ways certain types of words were used in criminal trial transcripts. However, in both cases they were light on the details, and it would appear decided not to guess on elements of the crime that they couldn’t be confident in. That is, they spoke confidently when they were confident but otherwise stayed silent.

I think my analysis fell in the middle. I got lots of bits right, but I was also wrong just about as often. I was disappointed that I couldn’t pick out the roles people played in the trial from the word cloud. There was no way to guess who was the defendant, who was the victim, and without recognizing the names, who were the officers. I wasn’t even able to guess how many defendants were on trial. On the other hand, I did guess the type of crime, the verdict, and a few details and circumstances surrounding the arrest. Having said that, I can’t say I was confident in all of my conclusions and was guessing.

And finally, two of the participants were way off (#1 and #2). These two attempted to reconstruct the narrative of what had happened, providing a level of detail that involved a good deal of guesswork.

What does this mean? I think it shows that when faced with a simplified visualization such as a word cloud, the process of getting back to the original is fraught with a level of guesswork. However, an expert in the source material can, with reasonable accuracy, reconstruct some of the more basic details of what’s going on.

How do we move forward? Well, as I pointed out earlier in this article, I think the secret is in moving away from the idea that tokens transmit ideas. Ideas and metaphors transmit ideas, and it would be far more useful to have an idea or concept cloud than one that focuses on individual tokens. But I also think it’s time that those ideas were linked back to the original data points, so that people interpreting the word clouds can test their assumptions. We are ready to see the distance between the underlying data and the visualization contracted. We’re ready to see the proof embedded in the graph. And I hope we continue to see a development in this trend.

Thanks to my participants, Janice Turner, Bob Shoemaker, Louise Falcini, and Tim Hitchcock.

Thursday, August 1, 2013

Can you explain this graph to me? Peer Reviewing a Visualization

"For sale: Mixing bowl set designed to please a cook".

That opening sentence contains 10 words, or "tokens" as linguists often call them. Yet either in its spoken or written form, it really only transmits 4 ideas, or what I imagine Marc Alexander would call "metaphors", which are concepts that go beyond the words but that express meaning and understanding. They allow us to think in chunks.

What?: For sale
What's for sale?: a mixing bowl set
What's it like?: designed to please
Please whom?: a cook

The same sentence represents an attempt to conjure a very measured set of thoughts in another person. I can't take credit for the sentence, but when the author wrote it down, they hoped that you, dear reader, would understand those 4 ideas in the same way as all the other readers, and as they themselves understood them. It's their attempt to control your mind temporarily by drawing upon your understanding and memories associated with those 4 ideas. We may not get all the details exactly the same. Your mixing bowl set may be blue. Mine is seafoam and has spout on each bowl to make it easier to pour your batter into the baking tin. So we likely havn't had exactly the same understanding of the sentence, but our understandings are almost certainly within the limits of what's acceptable to the author.

If we add 2 more ideas to the end of the sentence we end up with a failed conjuration:

"For sale: Mixing bowl set designed to please a cook with a round bottom for efficient beating".

Because of the misplaced modifier, there are now two ways to understand these ideas. Does the bowl have a round bottom for efficient beating, or should the cook who will enjoy the bowl be so proportioned?

Visualizations can offer the same ambiguity.

Is this an image of a rabbit, or a duck?

In this case, it's both, and it's that very ambiguity that the artist intended us to understand. Not all visualizations are intended to teach us something specific, or to so carefully conjure a series of ideas in our minds. That's wholly too modernist for some. Visualizations can be exploratory, used by researchers to come to a different understanding of their data by slicing it in lots of ways until they see something interesting. Or, as I demonstrated in an earlier post, can be a quick way to get a distant look at a large amount of data by reducing it to something easier to digest. In that sense graphing can aid the discovery process of research even before the conclusions are ready to be shared with the world.

But when it comes to visualizations for academic publication, unintentional ambiguity is something we must strive to avoid. If done well, there should only be one proper way of interpreting the visualization. It's our job to create something that can conjure specific thoughts in the reader's head based on the graph's shape, colour, size, orientation, etc. And it should go without saying that those conjured thoughts should be grounded in rigorous research.

As academics we spend so much time and care on our prose, and even our footnotes. Usually (we hope) that prose comes out lucid and if we're lucky, is enjoyable to read. One of the ways we ensure that is through peer review. The editors help us find people who are willing to take the time to read what we've written and provide constructive feedback upon it.

Yet few of us feel we have the aptitude to offer similar feedback on visualizations. We're not visual artists and so we can be forgiven for using colour in confusing ways, or for thinking a pie chart with 100 categories is a good way to express an idea. As I mentioned previously, I'm quite confident that in the present climate, unique looking or impressive visualizations will slip through peer review unchecked, lest the reviewer's lack of expertise in visualization be exposed by making a comment to the effect of "I don't under stand this graph".

Now, far be it from me to suggest we only use column graphs or line graphs, or that we do X, but not Y. I think it's fantastic that so many people out there are pushing the boundaries of what we can achieve via visualization. The folks at the Guardian Data Blog do great work on bringing data to life, and are a wonderful place for anyone seeking inspiration.

Instead, what I would suggest is that as creators of academic visualizations, we make sure our graphs are reviewed, even if our reviewers cannot or will not do so in the traditional peer review process.

The way I'd propose we do that is to show our friends and colleagues what we've made as often as we can, including during the drafting process. But it's not just about showing them. We have to ask the right questions. Let's use the graph below as a (relatively poor) example of a visualization that we might like to get feedback on. Please note that this is not a graph showing real data about the cost of grain in the 19th century. It's just an example.

Most of us likely want to ask "Do you like my graph?" or "What do you think of this?"

A more productive starting point is probably: Can you explain this graph to me? You aren't going to be there when your reader or viewer is interpreting your graph. The best way to find out what set of ideas are going to form in their mind is to ask them to explain their thought process out loud.

In this case, I had intended to show the seasonal difference in the price of grain in London and Edinburgh over a 20 year period. You may not have picked up on that, which means I need to fix something.

Don't be affraid to ask explicitly: Is there any element of this graph that you do not inherently understand? Make sure they can explain the labels on both axes (if relevant). If they don't know where you're getting those values from you may need to rethink your axis labels. You'd be forgiven for asking what the numbers on the Y-axis represent in the example. I didn't label it, so how could you know?

When you start experimenting with your visualizations, you're bound to come up with ideas you think are clear, but that just don't translate into ideas that your reader can interpret. Looking at the sample graph, I wouldn't fault you for asking what the top and bottom line of the curves represent. They're supposed to be two line graphs: one representing Edinburgh prices, and one representing London. I've shaded in the space between the lines to emphasize the size of the gap. If this is in fact two lines, then which one is Edinburgh? Which one is London? And when they overlap, how do I know which bit corresponds to which line? Do they cross, or merely meet and diverge again? I havn't made the fact that this is a line graph obvious because the lines aren't distinguishable from the shape formed by the colours.

Speaking of colour, you'll want to make sure you havn't come up with a palette that is going to make interpreting your graph difficult for someone with colour blindness. There are many different forms of colour-blindness, so it pays to run a test on your graph. You can do this online by using a "Colour Blindness Simulator" on your finished image.

Sticking with the negatives, ask your tester which element of the graph they like the least. For the sample graph, they may say they don't like the colours, or the font, or the legend. Personally, I think using --------> to represent arrows looks lazy. Everyone will have their own opinions on what's worst about your work. If you know what turns people off you can make visualizations that people like. And if they like the visualization, readers are more likely to engage with its message. With this in mind, go ahead and ask if they like your graph. Or if there are any elements of the graph that they particularly fancy.

Just as with your prose, it may take a few iterations and a number of different opinions from colleagues before a graph says to others what you think it says in your own mind. Just because you submitted a graph with your article and the peer-reviewers didn't comment on it doesn't mean you've done a good job of clearly expressing your ideas visually.

And one last question to ask, just to make sure your readers get the right message and aren't distracted: does the shape of the graph make it look like anything unrelated?

Graphs and visualizations have tremendous potential for expressing ideas in academic research, but it's not a skill we're typically taught in school. Most of us learn on the job, or emulate graphs we saw elsewhere that we found effective. Taking the time to ensure the graphs you create transmit the right ideas to your reader is good scholarship. Knowing the right questions to ask makes it that much easier to reach that result.

Questions to ask about a visualization:
  1. Can you explain this graph to me?
  2. Are there any elements you do not inherently understand?
  3. Can you explain what each axis shows (if applicable)
  4. Will people with colour blindness be able to differentiate your colour palette? (check online)
  5. What do you like least about the graph?
  6. Do you like the graph / a particular element of the graph?
  7. Does the shape of the graph make it look like anything distracting?