Being Open: Raw Data & Mapping the Scottish Reformation

In version 1.1 of our project website, we introduced a new feature: a button on each map that allows users to download the results of their chosen query in JSON format. While this seems like a small and rather specialist gesture, this change to our website underlined a fundamental aim of Mapping the Scottish Reformation: the desire for our data to be open for future researchers to use, without our input.

The download button that launched a thousand…datapoints?

We all know that digital projects in the humanities are often labour intensive. The initial effort to gather information, in our case from manuscripts, is time consuming and requires considerable amounts of skill (reading the handwriting, knowing what you’re looking for, being able to parse lots of information, etc). Then there is the encoding of that data into structures that a machine can read and query. Up until relatively recently, a great many digital projects in the humanities have developed custom methods to deal with these problems, resulting in beautifully crafted projects. However, these methods often (though not always) have one flaw: they usually don’t play nice with other projects. The data you can extract on the website or app is often limited to the types of questions that were originally envisaged by the project’s designers. These projects aren’t easily interoperable.

Think about it this way: when you wanted information from your favourite or most used online humanities resource, like a historical database, it is highly likely that you wrote down the data you found by hand or in a document on Microsoft Word or equivalent. You might have taken a screen grab. This is your record. This is fine if you need a relatively small piece of information, but what if you want to leverage the power of a significant proportion of the dataset of that entire website? You can’t use it, you have to laboriously (and manually) write down the data you need, or you’re forced to contact the creators of the resource and ask for permission to access the raw dataset. The creators (if they’re still active) will then need to reframe their data so it works for your purposes.

It is our contention that the creators of digital projects in the humanities should think about interoperability when considering the legacy of their projects. Legacy is much more than just the longevity of your website or resource (“we need x years of hosting costs” or “we need to ensure that this website doesn’t break whenever there’s a major iOS or Android update”). Longevity needs to be understood as the life and utility of the data at the heart of your project, long after the research questions that created your project are obsolete.

This process of siloing data in custom repositories is not helpful to developing humanities projects. As part of our talk for the Northern Early Modern Network in September 2021, we showed how considering three key features can help make a project more sustainable by encouraging interoperability.

Interoperability slide from our presentation ‘Using Digital Tools to Explore Early Modernity’

First is to be open about data structures. Scholars building digital resources have spent hours considering how to encode information from manuscripts into databases: basically this is a process of establishing how to make the complexities of these records machine readable. Unfortunately, up to now, projects rarely share how they structure their data collection, seeing it as something that is largely internal to the project. We’ve created standardised ways to capture complex data that will work in any project, right down to the references. We write about this process on this blog.

Second, offer your data openly online. While most data from digital projects is available for free and not behind a paywall, the data can never be scraped and extracted from the resource. At least, not easily. For us, Wikidata does the heavy lifting: no need to create a custom database, allows for quick prototyping of visualisations, and interfaces nicely with other platforms and programs. Everything is uploaded in accordance with Creative Commons Zero, meaning it’s free to use without a licence. The raw data of our research can be used by scholars for free long into the future and without the need to enter into lengthy negotiations with us about how to use it. Projects should explore similarly open data standards wherever possible.

Third, data should be exportable. People searching through your digital resource should be given the option to download their results. Why? Well, many archives have poor or non-existent wifi, meaning scholars may need quick access to information offline. I suppose more important, however, at least for the purposes of this post, is that having data available to download means that it can be immediately switched to different applications. Is there a visualisation one of our users wants to see but that we haven’t created (perhaps for a teaching resource, presentation, or PhD thesis)? Fine, they can now create it with the raw data. Want to use a new technology that we couldn’t incorporate into our website? The exportable raw data can be repurposed into the new or emergent technology. Having data that can be downloaded in a way that suits the user can empower them to perform a broad range of new projects that we could never have envisaged.

Data should be exportable. Here is a downloaded search from our website in JSON format.

We still have a way to go before we achieve all of these lofty goals as well as we would like. For example, our current website only allows results to be downloaded as a JSON, which limits its use to certain applications (TSV or CSV may be more attractive to a wider range of uses and a simple HTML list would be much easier to use for non-specialist users). Similarly, we could do more to emphasise our methods of structuring data on Wikidata to promote the availability of our dataset. And neither are we the first to consider these issues: colleagues at University of Edinburgh revived data from the Survey of Scottish Witchcraft (gathered between 2000 and 2003) and repurposed it by using many of these techniques (we owe a great deal from them).

Open standards come with some risk, but we would argue that the benefits are more numerous. We must ensure robust editorial standards for our work, so we need to ensure quality control of our data. Similarly, methods to acknowledge the original labour put into the data gathering becomes more difficult the further one moves away from traditional citation techniques like footnotes or endnotes. Nevertheless, sharing knowledge about data structuring and giving access to raw data in some form can enhance the interoperability of a project.

Ultimately, considering interoperability at the start of your project may extend the longevity of all of that effort in collecting data in the first place.

Chris R. Langley