We’ve all seen Word Clouds. Many of us have even wondered if they’re
of any value. I have used word clouds in the past; I find them useful in
presentations when I want to highlight the relative importance of certain words
over others. For example, I often use this word cloud to the left, to show the
most common Irish surnames in the London area during the early 19th
century. I hope my listeners will note that Murphy or Sullivan is more common
than Burke or Foley, without me having to take the time to explain the
connection between word-size and significance.
I think both of those uses for word clouds
have been productive. They’ve allowed me to transmit ideas, and formulate my own
thoughts on a set of data in an effective manner. But I began to think about
other uses, and I began to wonder about the process of getting back to the
original data. Word clouds take the individual words (tokens) out of context.
As I mentioned in my last post, we think in metaphors, or ideas. Not in words.
That means a word cloud reduces a single idea such as “green bowl” into two
tokens “green” and “bowl”. It then combines the word “green” into a single
graphic based on how often it appears in the text. The program does not take
into consideration the fact that “green” as it refers to a bowl is entirely
different than Mr. Green or Green Park. An article about Mr. Green’s picnic in
Green Park with his favourite green bowl might give you a skewed idea about the
importance of the word green, here representing three completely different
ideas, and in all three cases simply acting as modifiers to more important concepts (a man, a park, and a bowl).
Just for fun, I decided to do a test. I
asked 4 colleagues, all experts on the criminal trial transcripts of the Old
Bailey Online, to look at word cloud of a trial. Each person was asked to
describe what key information they could tell me about the crime. I was
interested in knowing if they could tell me the who, what, when, where, why
type details, and if they could reconstruct the basic building blocks from the
prevalence of certain keywords. In the spirit of exploration, I played along as
well and offered my own interpretation.
The word cloud was created at random by my
wife without my knowledge of the trial. All I knew (and all the participants
knew) was that the trial took place between 1801 and 1820, and was between
2,000 and 3,000 words long. The word cloud was limited to 75 unique words and
common English words were removed. The resultant visualization looks like this:
I have colour-coded my assessment (blue)
for aspects I got correct and (red) for the bits I got wrong. What struck me
immediately was that this was a case involving theft.
That’s a safe bet anyways, since about 50% of Old Bailey trials during this
period were theft cases. It was the large number of nouns that led me to this
conclusion. I know that trial transcripts always list the items that were
stolen, and the testimony in the trial almost always discusses the various
objects repeatedly as several witnesses are called to give their account of
what transpired. In this case, I’d assume there was a large quantity of alcohol
that went missing, ranging from red wine, to port, to gin – stored in bottles, measured
and in at least one case: a cask.
At the time it was stolen the booze was
being stored in a cellar before it was transferred to a cart that was being driven by a horse – a
be specific. Why Restoration actress Nell Gwyn appears in the set of words, I have no
idea since she died over a century before this era, unless the name is a
coincidence or perhaps refers to the name of a pub
that lost its liquor.
There were a large number of people
involved with giving evidence against the defendant including Messers Hutt,
as well as possibly a Mr. Limbrick, and definitely someone named Hart –
though again that may be the name of the pub.
One of those men is likely the watchman and another an officer. Based on what I know
about Old Bailey trials, this suggests there were a lot of witnesses, meaning the
prosecutor was concerned that his case may not have succeeded.
The alcohol heist took place in the morning (and
was perhaps discovered the following night), and the goods were then transferred down
either Maiden-lane or City-road. Given the volume of goods
stolen and the fact that death appears
in the list suggests our defendant was found guilty
and sentenced to death.
My conclusion: a pub named either Hart or Nell Gwyn
located on City-road or Maiden-lane was robbed of a large volume of alcohol by a solo male defendant, who was found
guilty and sentenced to death.
You can compare my assessment with the full trial transcript.
It turns out I wasn’t that far off. There had been three defendants, but they
were found guilty and sentenced to death. It had been a large alcohol theft. I
wasn’t able to accurately pick out the fact that Messers Wood, Powell, and Hart
were the defendants, meaning the who aspect
of the challenge had completely eluded me. I also didn’t recognize Bagnigge Wells, which was the location
of the crime, not someone’s name. Nell
Gwyn in this case was the pub that had its door pried open to reach the
“Powell and Hutt were
found guilty of breaking into the wine cellar
of the Nell Gywnn inn in Bagnigge Wells (I
know about this place) They were accused of stealing two gallons of wine, casks of gin and bottles of port
from the cellar belonging to Mr Hart
. A Watchmen Mr Wood on his round at 1 o'clock saw a broken iron
(lock)on the wine cellar door and saw two men drive off in a horse and cart down Maiden lane towards City Road and
immediately called for an officer. The men were stopped by an officer who
examined the cart and found the cask of wine, gin and
port belonging to Mr Hart and arrested them. They received the death penalty.”
“I would guess it involves stealing a hamper of goods including a bottle of wine from the
back of a cart. My suspicion is that the
defendants were two women, and that the servant
of the person who owned the cart/hamper discovered the theft and was called in
evidence, though the actual owner wasn't.
The cart was on its way to or from northwest
London to banigge wells, for a social occasion and some visiting, it
happened at night (suggesting they were
returning home) and there was a runner or 'officer'
involved in the arrest.”
In these examples, the
participants tried to be very specific about the details of the transcript, and
in doing so were incorrect more often than not. The basics of the case,
including the location of the crime, was however, correct for participant 1.
This would suggest that expert readers are able to get some of the basics –
though by no means all. However, that expertise does little to bring forth the
specifics of the trial.
“This trial seems to have a richer
vocabulary than most. It looks
like a theft case from a wine cellar (wine, port,
cask, gin, gallons, hampers, bottles, etc. suggest as much), presumably
at Bagnigge Wells; actually the prominence
of the word cart, and also the word horse, suggest
the material might have been in the process or coming or going there,
perhaps parked on the City Road (or Maiden Lane).
Went suggests action. Some force
was used: broken, crow, saw, which suggests
that the cart was broken into. There is a certain amount of
vocabulary indicating how the culprit might have been apprehended: officer, stopped, examined, watchman, observed,
charge—this suggests that officers were used to apprehend the defendant.
There are some names: Powell, Gwyn, Hart, Mr, Nell, William.
The most frequently mentioned, Hart, is
presumably the victim. Timing is
important: o'clock, morning, night—the prominence
of night suggests that that is when the crime took place, with the
suspect arrested in the morning? Numbers indicate either the number of
items stolen or the time of day (one or two in the morning).”
This one came out surprisingly accurate.
The participant hadn’t recognized Nell Gwyn as the name of the actress. As in
my own case, it proved impossible to figure out who was the defendant and who
was the victim. However, the details of the crime and the process of
apprehending the defendant is almost bang on. This participant didn’t try to
reconstruct the narrative in the same way as Participant 1 or 2, and thus
avoided many of the pitfalls. However, there has been no guess at the verdict,
and while the basics of the trial are here, the richness of what actually is
recorded in the transcript is nearly entirely lost.
– recognise this as the location of a rather seedy spa/pleasure
grounds on the outskirts of Clerkenwell
towards Kings Cross.
– not too far away runs from the City to Islington
– small street that runs parallel to Covent Garden Piazza on the
south-side although there might be
other Maiden Lanes
Time –mention of time o’clock and watchman (usually worked only at
and night although mentions morning (in
the or the next)
Crime probably burglary
because of the mention of the relevance of time
together with watchman
Stolen Goods: bottles
of red wine, port, cask, gin from wine cellar
Broken into cellar with
an iron bar, maybe crow bar
Probably took it away or
hid it in a cart, certainly it is central to the plot and
perhaps discovery of the
stolen goods. There may well have been a significant
amount of drink gallons,
casks and bottles? Hampers – might have been used
to transport/hide the
Arrested – yes (examined)
and officer (probably from one of the Police
Observed - spotted
– think this is a personal name rather than pub sign and possibly
as it’s mentioned frequently but not as much as the things I think were
Gwyn –forename Welsh? Ha – just spotted
nell (same size) and I know that
Nell Gywn was supposed to
have lived at Banigge Wells. John,
is a personal name (not a shed)
Death –punishment (so
This one was particularly interesting,
because the participant included their thought process as it related to the
various words on the visualization. The fact that it was clearly a theft of
goods, and that time was mentioned, lead this person to conclude we were
dealing with a burglary, which was correct. The conclusion that Hart was the name of a person rather
than a pub was correct, but equally could have been wrong. While Nell Gwyn may
have lived at Bagnigge Wells, that wasn’t really relevant to this case.
Can an expert on a historical source
reconstruct the details of that source from a word cloud? It would seem the
answer is: sort of. Of the five participants, two (#3 and #4) did an incredibly
good job of getting some of the details. These people were able to reason out
what the words meant by drawing upon their experience with the ways certain
types of words were used in criminal trial transcripts. However, in both cases
they were light on the details, and it would appear decided not to guess on
elements of the crime that they couldn’t be confident in. That is, they spoke
confidently when they were confident but otherwise stayed silent.
I think my analysis fell in the middle. I
got lots of bits right, but I was also wrong just about as often. I was
disappointed that I couldn’t pick out the roles people played in the trial from
the word cloud. There was no way to guess who was the defendant, who was the
victim, and without recognizing the names, who were the officers. I wasn’t even
able to guess how many defendants were on trial. On the other hand, I did guess
the type of crime, the verdict, and a few details and circumstances surrounding
the arrest. Having said that, I can’t say I was confident in all of my
conclusions and was guessing.
And finally, two of the participants were
way off (#1 and #2). These two attempted to reconstruct the narrative of what
had happened, providing a level of detail that involved a good deal of
What does this mean? I think it shows that
when faced with a simplified visualization such as a word cloud, the process of
getting back to the original is fraught with a level of guesswork. However, an
expert in the source material can, with reasonable
accuracy, reconstruct some of the more basic details of what’s going on.
How do we move forward? Well, as I pointed
out earlier in this article, I think the secret is in moving away from the idea
that tokens transmit ideas. Ideas and metaphors transmit ideas, and it would be
far more useful to have an idea or concept cloud than one that focuses on
individual tokens. But I also think it’s time that those ideas were linked back
to the original data points, so that people interpreting the word clouds can
test their assumptions. We are ready to see the distance between the underlying
data and the visualization contracted. We’re ready to see the proof embedded in
the graph. And I hope we continue to see a development in this trend.
Thanks to my participants, Janice Turner, Bob Shoemaker, Louise Falcini, and Tim Hitchcock.