I recently attended Bill Turkel's Workshop on APIs for the Digital Humanities held in Toronto and had the pleasure of coming face to face with many of the people who have created the historical data repositories I have used so enthusiastically.
What I came away with was an even stronger conviction that data in humanities repositories should be completely liberated. By that, I mean given away in their entirety.
The mass digitizations that have occurred recently have provided a great first step for researchers. I no longer need to fly to London and sit in a dark room sifting through boxes to look at much of the material I seek. But, I'm still - generally - unable to do with it what I like.
Many repositories contain scanned documents which have been OCR'd so that they are full text searchable, but that OCR is not shared with the end user, rendering the entire thing useless for those wanting to do a text analysis or mash up the data with another repository's content.
Most databases require the user to trust the site's search mechanisms to query the entire database an return all relevant results. If I'm doing a study, I'd prefer to do so with all the material available. Without access to the entire database on my hard drive, I have no way of verifying that the search has returned what I sought.
Many of those at the workshop who administered repositories were willing and eager to email their data at the drop of a hat, but that is not yet the norm. Most of my past requests for data have been completely ignored. When it comes to scholarly data, possessiveness leads to obscurity.
As humanists become increasingly confident programmers, many will define research projects based on the accessibility of the sources. Those who are giving their data away will end up cited in books and journal articles. Those desperate to maintain control will get passed by. If someone asks you for your data, think of it as a compliment of your work, then say yes.