Monday, August 5, 2013

Can We Reconstruct a Text from a Wordcloud?

We’ve all seen Word Clouds. Many of us have even wondered if they’re of any value. I have used word clouds in the past; I find them useful in presentations when I want to highlight the relative importance of certain words over others. For example, I often use this word cloud to the left, to show the most common Irish surnames in the London area during the early 19th century. I hope my listeners will note that Murphy or Sullivan is more common than Burke or Foley, without me having to take the time to explain the connection between word-size and significance.

I’ve also used word clouds in analysis. In a previous post I discussed how I was able to use the below word cloud to show the relative frequency of topics found in the Gentleman’s Magazine between 1800 and 1820, which allowed me to get a pretty good idea of what the gentry and the middle class were interested in during that period.

I think both of those uses for word clouds have been productive. They’ve allowed me to transmit ideas, and formulate my own thoughts on a set of data in an effective manner. But I began to think about other uses, and I began to wonder about the process of getting back to the original data. Word clouds take the individual words (tokens) out of context. As I mentioned in my last post, we think in metaphors, or ideas. Not in words. That means a word cloud reduces a single idea such as “green bowl” into two tokens “green” and “bowl”. It then combines the word “green” into a single graphic based on how often it appears in the text. The program does not take into consideration the fact that “green” as it refers to a bowl is entirely different than Mr. Green or Green Park. An article about Mr. Green’s picnic in Green Park with his favourite green bowl might give you a skewed idea about the importance of the word green, here representing three completely different ideas, and in all three cases simply acting as modifiers to more important concepts (a man, a park, and a bowl).

Just for fun, I decided to do a test. I asked 4 colleagues, all experts on the criminal trial transcripts of the Old Bailey Online, to look at word cloud of a trial. Each person was asked to describe what key information they could tell me about the crime. I was interested in knowing if they could tell me the who, what, when, where, why type details, and if they could reconstruct the basic building blocks from the prevalence of certain keywords. In the spirit of exploration, I played along as well and offered my own interpretation.

The word cloud was created at random by my wife without my knowledge of the trial. All I knew (and all the participants knew) was that the trial took place between 1801 and 1820, and was between 2,000 and 3,000 words long. The word cloud was limited to 75 unique words and common English words were removed. The resultant visualization looks like this:
I have colour-coded my assessment (blue) for aspects I got correct and (red) for the bits I got wrong. What struck me immediately was that this was a case involving theft. That’s a safe bet anyways, since about 50% of Old Bailey trials during this period were theft cases. It was the large number of nouns that led me to this conclusion. I know that trial transcripts always list the items that were stolen, and the testimony in the trial almost always discusses the various objects repeatedly as several witnesses are called to give their account of what transpired. In this case, I’d assume there was a large quantity of alcohol that went missing, ranging from red wine, to port, to gin – stored in bottles, measured by gallons, and in at least one case: a cask.

At the time it was stolen the booze was being stored in a cellar before it was transferred to a cart that was being driven by a horsea mare to be specific. Why Restoration actress Nell Gwyn appears in the set of words, I have no idea since she died over a century before this era, unless the name is a coincidence or perhaps refers to the name of a pub that lost its liquor.

There were a large number of people involved with giving evidence against the defendant including Messers Hutt, Wells, Powell, Wood and Bagnigge, as well as possibly a Mr. Limbrick, and definitely someone named Hart – though again that may be the name of the pub. One of those men is likely the watchman and another an officer. Based on what I know about Old Bailey trials, this suggests there were a lot of witnesses, meaning the prosecutor was concerned that his case may not have succeeded.

The alcohol heist took place in the morning (and was perhaps discovered the following night), and the goods were then transferred down either Maiden-lane or City-road. Given the volume of goods stolen and the fact that death appears in the list suggests our defendant was found guilty and sentenced to death.

My conclusion: a pub named either Hart or Nell Gwyn located on City-road or Maiden-lane was robbed of a large volume of alcohol by a solo male defendant, who was found guilty and sentenced to death.


You can compare my assessment with the full trial transcript. It turns out I wasn’t that far off. There had been three defendants, but they were found guilty and sentenced to death. It had been a large alcohol theft. I wasn’t able to accurately pick out the fact that Messers Wood, Powell, and Hart were the defendants, meaning the who aspect of the challenge had completely eluded me. I also didn’t recognize Bagnigge Wells, which was the location of the crime, not someone’s name. Nell Gwyn in this case was the pub that had its door pried open to reach the alcohol.

Participant 1

“Powell and Hutt were found guilty of breaking into the wine cellar of the Nell Gywnn inn in Bagnigge Wells (I know about this place) They were accused of stealing two gallons of wine, casks of gin and bottles of port from the cellar  belonging to Mr Hart . A Watchmen Mr Wood on his round at 1 o'clock saw a broken iron (lock)on the wine cellar door  and saw two men drive off in a horse and cart down Maiden lane towards City Road and immediately called for an officer. The men were stopped by an officer who examined the cart and found the cask of wine, gin and port belonging to Mr Hart and arrested them. They received the death penalty.”

Participate 2

“I would guess it involves stealing a hamper of goods including a bottle of wine from the back of a cart.  My suspicion is that the defendants were two women, and that the servant of the person who owned the cart/hamper discovered the theft and was called in evidence, though the actual owner wasn't.  The cart was on its way to or from northwest London to banigge wells, for a social occasion and some visiting, it happened at night (suggesting they were returning home) and there was a runner or 'officer' involved in the arrest.”


In these examples, the participants tried to be very specific about the details of the transcript, and in doing so were incorrect more often than not. The basics of the case, including the location of the crime, was however, correct for participant 1. This would suggest that expert readers are able to get some of the basics – though by no means all. However, that expertise does little to bring forth the specifics of the trial.

Participant 3

“This trial seems to have a richer vocabulary than most.  It looks like a theft case from a wine cellar (wine, port, cask, gin, gallons, hampers, bottles, etc. suggest as much), presumably at Bagnigge Wells; actually the prominence of the word cart, and also the word horse, suggest the material might have been in the process or coming or going there, perhaps parked on the City Road (or Maiden Lane). Went suggests action.  Some force was used: broken, crow, saw, which suggests that the cart was broken into.   There is a certain amount of vocabulary indicating how the culprit might have been apprehended: officer, stopped, examined, watchman, observed, charge—this suggests that officers were used to apprehend the defendant. There are some names: Powell, Gwyn, Hart, Mr, Nell, William.  The most frequently mentioned, Hart, is presumably the victim.  Timing is important: o'clock, morning, night—the prominence of night suggests that that is when the crime took place, with the suspect arrested in the morning?  Numbers indicate either the number of items stolen or the time of day (one or two in the morning).


This one came out surprisingly accurate. The participant hadn’t recognized Nell Gwyn as the name of the actress. As in my own case, it proved impossible to figure out who was the defendant and who was the victim. However, the details of the crime and the process of apprehending the defendant is almost bang on. This participant didn’t try to reconstruct the narrative in the same way as Participant 1 or 2, and thus avoided many of the pitfalls. However, there has been no guess at the verdict, and while the basics of the trial are here, the richness of what actually is recorded in the transcript is nearly entirely lost.

Participant 4

“Geographical location
Wine Cellar –property crime
Bagnigge Wells – recognise this as the location of a rather seedy spa/pleasure
grounds on the outskirts of Clerkenwell towards Kings Cross.
City Road – not too far away runs from the City to Islington
Maiden Lane – small street that runs parallel to Covent Garden Piazza on the
south-side although there might be other Maiden Lanes
Time –mention of time o’clock and watchman (usually worked only at night)
and night although mentions morning (in the or the next)
Crime probably burglary because of the mention of the relevance of time
together with watchman
Stolen Goods: bottles of red wine, port, cask, gin from wine cellar
Broken into cellar with an iron bar, maybe crow bar
Probably took it away or hid it in a cart, certainly it is central to the plot and
perhaps discovery of the stolen goods. There may well have been a significant
amount of drink gallons, casks and bottles? Hampers – might have been used
to transport/hide the goods
Arrested – yes (examined) and officer (probably from one of the Police Courts
Observed - spotted
Names Hart – think this is a personal name rather than pub sign and possibly
Wood as it’s mentioned frequently but not as much as the things I think were
actually stolen
Gwyn –forename Welsh? Ha – just spotted nell (same size) and I know that
Nell Gywn was supposed to have lived at Banigge Wells. John, William, think
Hutt is a personal name (not a shed)
Death –punishment (so guilty)”


This one was particularly interesting, because the participant included their thought process as it related to the various words on the visualization. The fact that it was clearly a theft of goods, and that time was mentioned, lead this person to conclude we were dealing with a burglary, which was correct.  The conclusion that Hart was the name of a person rather than a pub was correct, but equally could have been wrong. While Nell Gwyn may have lived at Bagnigge Wells, that wasn’t really relevant to this case.


Can an expert on a historical source reconstruct the details of that source from a word cloud? It would seem the answer is: sort of. Of the five participants, two (#3 and #4) did an incredibly good job of getting some of the details. These people were able to reason out what the words meant by drawing upon their experience with the ways certain types of words were used in criminal trial transcripts. However, in both cases they were light on the details, and it would appear decided not to guess on elements of the crime that they couldn’t be confident in. That is, they spoke confidently when they were confident but otherwise stayed silent.

I think my analysis fell in the middle. I got lots of bits right, but I was also wrong just about as often. I was disappointed that I couldn’t pick out the roles people played in the trial from the word cloud. There was no way to guess who was the defendant, who was the victim, and without recognizing the names, who were the officers. I wasn’t even able to guess how many defendants were on trial. On the other hand, I did guess the type of crime, the verdict, and a few details and circumstances surrounding the arrest. Having said that, I can’t say I was confident in all of my conclusions and was guessing.

And finally, two of the participants were way off (#1 and #2). These two attempted to reconstruct the narrative of what had happened, providing a level of detail that involved a good deal of guesswork.

What does this mean? I think it shows that when faced with a simplified visualization such as a word cloud, the process of getting back to the original is fraught with a level of guesswork. However, an expert in the source material can, with reasonable accuracy, reconstruct some of the more basic details of what’s going on.

How do we move forward? Well, as I pointed out earlier in this article, I think the secret is in moving away from the idea that tokens transmit ideas. Ideas and metaphors transmit ideas, and it would be far more useful to have an idea or concept cloud than one that focuses on individual tokens. But I also think it’s time that those ideas were linked back to the original data points, so that people interpreting the word clouds can test their assumptions. We are ready to see the distance between the underlying data and the visualization contracted. We’re ready to see the proof embedded in the graph. And I hope we continue to see a development in this trend.

Thanks to my participants, Janice Turner, Bob Shoemaker, Louise Falcini, and Tim Hitchcock.

1 comment:

Mike Cusson said...

Thoughts as initiated above acquiring about more of the possible details for the students which must have been followed by them earlier. quantitative analysts