Sunday, May 11, 2008

The Importance of MetaData on websites

As Bill Turkel noted in his recent blog post, I've been doing a summer internship, part of which involves making translators for a program called Zotero.

If you're a historian or a history student and you don't know what Zotero is, you should definitely look into it. It allows you to save and collect bibliographical information for just about anything you find on the internet, often with the click of a button.

For example, if you go to a webpage about the United Irishmen, you can use Zotero to save a snapshot of the page (kind of like a bookmark), you can attach notes to it, create tags to help you remember what the page is about, add the author's name, date, publisher...just about whatever you want. You can then export that bibliographic data in proper Chicago/MLA/APA format and save yourself writing it all out.

Some pages are even easier to use. These are pages that Zotero has translators for. On these pages - Amazon.com for example, a little icon will appear in the address bar of the page. If you're looking at the entry for Harry Potter and the Philosopher's Stone, when you click the little icon Zotero automatically saves all the relevant bibliographical information for you. You don't have to type in a thing!
Unfortunately, these translators have to be made one by one. Each and every page on the internet has to have its own translator. Because of this, only the most important historical repositories are currently supported.

You'll find them for websites such as JSTOR, Amazon, even the University of Western Ontario's library page. But there are several important (often Canadian) pages that are not yet working.

How a translator works, is that a JavaScript program is told to check if the webpage you're currently on is one of the webpages that Zotero knows how to find bibliographic data on. This often entails checking the website's address. For example:

If this webpage's address starts with www.canadiana.org then,
I should load the translator for Canadiana.org.


The next part is quite a bit trickier! Zotero is just a program. It doesn't know anything about what it is reading. We have to teach it how to recognize which piece of information on the screen is the title, which tells us the author, etc. And I've noticed two distinct trends: those sites who provide this information in metadata and those who do not.

Metadata, for those who don't know, is helpful, clearly formatted information about your site. Go to any webpage, click on the "View" menu, and select "Page Source." If the website in question has metadata, you'll notice quite near the top several lines of code that read something like this:

This essentially tells us that there are some keywords that you might find helpful in remembering what this website is about. They include "Adam Crymble, history".

We can also tell that the author is "Adam Crymble" and this website was last revised in "spring 2008."

If a webpage contains this data, it makes it MUCH easier for other people to use the data on your page. Zotero can easily be taught to recognize that the words after the meta name=”author” tag should go in the bibliographical field "author." It is also quite easy to tell Zotero that words after the meta name=”keywords” should be separated and made into "tags" which you can then use to organize your work.

However, many...rather, most webpages do not have very good (or any) metadata. In these cases, it requires extensive work to tell Zotero what it is looking at. Rather than simply associating one metatag with one entry in Zotero, the person must analyze your page's HTML code, figure out how your page is structured and write a customized line of code called an XPath that looks something like this:

//div[@id="Content"]/div[@class="NormalRecord"]/table[@class="Bibrec"]/tbody/tr/td[2];

Don't worry, it looks like gobbledigook to me too. Each and every part of data that Zotero wants to collect needs a custom written Xpath like this. This one would find the title of a book in Canadiana.org's repository.

What could have been 3 lines of code had there been Metadata on the page now requires dozens of lines.

None of the three websites I have been working on translators for include metadata. In two of these cases - well known Canadian museums, the websites are almost brand new. They're visually stimulating and engaging. Yet the information is hidden in complex code and confusing paths.

In the 21st century, websites are not merely a static representation of one person's work. Especially those that hold information for others to use, such as libraries, archives and repositories. Designing your webpage to incorporate metadata makes the information you have put out there easier for others to use. It makes people more likely to use it. And it encourages people like those who use Zotero to help your site stand out, with exciting add-ons that are changing the way people do internet research.