digital preservation

A Blockchain For Archives: Trust Through Technology

At a time when the fragility and vulnerability of digital records are increasingly evident, maintaining the trustworthiness of public archives is more important than ever.

Video and sound recordings can be manipulated to put words into mouths of people who never said them, photographs can be doctored, content added to or removed from videos and recently, AI technology has “written” news articles that can mimic any writer’s style. All of these media and many other “born-digital” formats will come to form the public record. If archives are to remain an essential resource for democracy, able to hold governments to account, the records they hold must be considered trustworthy.

But is this really a problem for archives?

Until recently, this has not been a concern for archives. People trust archives, especially public archives. We are seen as experts, preserving and providing access to our holdings freely and over a lengthy period (since 1838 in the case of The National Archives in the UK). We could rest on our laurels. But the challenges to our practice brought by digital technologies have to lead us to question whether this institutional or inherited trust is enough when faced with the forces of fakery that have emerged in the 21st century.

In 2017, The National Archives of the UK, partnered with the Centre for Vision, Speech and Signal Processing (CVSSP) at the University of Surrey and Tim Berners-Lee’s non-profit Open Data Institute, started to research how a new technology could be harnessed to serve on the side of archives. The ARCHANGEL project is investigating how blockchain can provide a genuine guarantee of the authenticity of the digital records held in archives. A way of publicly demonstrating our trustworthiness by proving that the digital records held in archives are authentic and unchanged.

Often considered synonymous with Bitcoin, blockchain is the technology that underpins a number of digital currencies but it has the potential for far wider application. At root, it is the digital equivalent of a ledger, like a database but with two features that set it apart from standard databases. Firstly, the blockchain is append only, meaning that data cannot be overwritten, amended or deleted; it can only be added. Secondly, it is distributed. No central authority or organisation has sole possession of the data. Instead, a copy of the whole database is held by each member of the blockchain and they collaborate to validate each new block before it is written to the ledger. As a result, there is no centralised authority in control of the data and each participant has an equal status in the network: equal responsibility, equal rights and an equal stake.

As with any new technology, there are issues to be researched and resolved. The most common criticism is that 51% of the participants could collude to change the data written on the blockchain. This is less likely in the case of ARCHANGEL because it is a permissioned blockchain. This means that every member has been invited and their identity is known, unlike bitcoin networks where many of the members are anonymous.

A more practical issue that arose early on was around what information could be shared on an immutable database that would be available to the public, to prove that they were unchanged from the point of receipt by the archives. Every public archive holds records closed due to their sensitive content. This sensitivity sometimes extends to their filenames or descriptions so adding these metadata fields to the blockchain would not be appropriate. We settled on a selection of fields that included an archival reference and the checksum, a unique alphanumeric string generated by a mathematical algorithm that changes completely if even one byte is altered in the file. In this way, a researcher can compare the checksum of the record they download against the checksum on the blockchain (written when the record was first received, potentially many years previously) and see for themselves that the checksums match. As archives sometimes convert formats in order to preserve or present records to the public, the project has also developed a way of generating a checksum based on the content of a video file rather than its bytes. This enables the user to check that the video has not been altered for unethical reasons while in the archive’s custody.

So, the ARCHANGEL blockchain enables an archive to upload metadata that uniquely identifies specific records, have that data sealed into a “block” that cannot be altered or deleted without detection, and share a copy of the data with each of the other trusted members of the network for as long as the archives (some of the oldest organisations in the world) maintain it.

In the prototype testing, we found that the key to engaging other archives is in emphasising the shared nature of the network. Only by collaborating with partners can the benefits of an archival blockchain be realised by any of us. It is blockchain’s distributed nature that underpins the trustworthiness of the system; that enables it to be more reliable, more transparent and more secure, and therefore effective in providing a barrier against the onslaught of synthetic content.

At the same time, the effort of the organisations to make the prototype work demonstrates their trustworthiness: in wanting to share the responsibility for proving the authenticity of the records they hold, they demonstrate their expertise and honesty.

The arms race with the forces of fakery that archives find themselves in is the reason why The National Archives is thinking about trust. We do not want people to trust archives only because of their longevity and expertise. Instead, we want to demonstrate their trustworthiness. We want to provide what Baroness Onora O’Neill said was needed in the BBC Reith Lectures in 2002:

“In judging whether to place our trust in others” words or undertakings, or to refuse that trust, we need information and we need the means to judge the information.” O’Neill, A Question of Trust

This is what we think blockchain gives us as a profession: by being part of a network of trusted organisations which assure the authenticity of each other’s records, we demonstrate the trustworthiness of all of our records.

 

Acknowledgements

The ARCHANGEL Project would like to acknowledge the funding received from the ESPRC Grant Ref EP/P03151X/1.

Copyright

Header image: ‘Crown copyright 2019 courtesy of The National Archives’

Further details:

The project website is here: https://www.archangel.ac.uk/

For a more detailed paper about the project see: https://arxiv.org/pdf/1804.08342.pdf

The journey from a records management system to a digital preservation system

“People have had a lot of trouble getting stuff out of RecordPoint.”

This sentence was a little worrying to hear. It was 2015, and our archive was contemplating digital preservation for the first time. We didn’t really know what it was, or how it worked. Neither did anyone else: the idea of having a “digital preservation system” received blank stares around the office. “Is it like a database? Why not use one of our CMS’s instead? Why do we need this?”

And so it was that I realised I was in over my head and needed outside help. I looked up state records offices to find out what they were doing, and realised there is such a thing as the job title “Digital Preservation Officer”. I contacted one of these “Digital Preservation Officers” to get on the right path.

The Digital Preservation Officer’s knowledge in that early conversation was invaluable, and helped us get over those early hurdles. She explained the basics: why digital preservation is important for an archive. How to get started. Breaking down jargon. Convincing non-archivists that yes, it is necessary. And – the importance of figuring out what you want to preserve.

“We will need to preserve digital donations,” I listed, “and digitizations of our physical inventory. Plus, I manage our digital records management system, RecordPoint – if we’re serious about our permanent records we will need to preserve those as well.” (The international digital records management system standard, ISO 16175 Part 2, says that “long-term preservation of digital records… should be addressed separately within a dedicated framework for digital preservation or ‘digital archiving’”.)

It was at this point that the Digital Preservation Officer replied with the quote that began this article.

I don’t think she was quite right – getting digital objects and metadata out of RecordPoint was quite easy. The challenge, it turned out, would be getting the exported digital objects into our digital preservation system, Archivematica.

In the image shown below, the folders on the left represent the top level of a RecordPoint export of two digital objects. The folders on the right are what Archivematica expects in a transfer package.

In the example above, there are three folders for ‘binaries’ (digital objects) and two folders for ‘records’ (metadata). Immediately something doesn’t make sense – why are there three binary folders for two objects?

The reason is that the export includes not only the final version of the digital object but also all previous drafts. In my example there is only a single draft, but if a digital object had 100 drafts, they would all be included here. This is great for compliance, but not so great for digital preservation where careful appraisal is necessary. The priority when doing an ‘extract, transform, load’ (ETL) from RecordPoint to Archivematica would be to ensure that the final version of each binary made it across to the ‘objects’ folder on the right.

An Archivematica transfer package should not only consist of digital objects themselves, of course – you are not truly preserving digital objects unless you also preserve their descriptive metadata. This is why the ‘metadata’ folder on the right exists: you can optionally create a single CSV file, ‘metadata.csv’, which contains the metadata for every digital object in the submission as a separate line. Archivematica uses this CSV file as part of its metadata preservation process.

In contrast, RecordPoint creates a metadata file for every one of the digital objects it exports. If you wanted to pull metadata across into the metadata CSV file for the Archivematica submission, you would need to go through every single metadata XML in the export and copy and paste each individual metadata element. Based on a test, sorting the final record from the drafts and preparing its metadata for Archivematica might take two to four minutes per record. Assuming we have 70,000 records requiring preservation, the entire process of transforming these records manually would take over 6,000 hours. Although technically possible, this is too much work to be achievable, and there would be a high likelihood of errors due to the tedious, detail-oriented work.

Fortunately, I knew the R programming language. R is used by statisticians to solve data transformation problems – and this was a data transformation problem! I created an application using a tool called R Shiny, providing a graphical interface that sits on the Archivematica server. I creatively called it RecordPoint Export to Archivematica Transfer (RPEAT). After running a RecordPoint export, you select the export to be transformed from a drop-down list in RPEAT and select the metadata to be included from a checklist. RPEAT then copies the final version of each digital object from the export into an ‘objects’ folder and trawls through each XML file to extract the required metadata. Finally, RPEAT creates a CSV file that contains all of the required metadata, and moves it into the ‘metadata’ folder. Everything is then ready for transfer into Archivematica.

Pushing 212 records exported from RecordPoint through RPEAT, selecting the correct metadata from the checklist, and doing some quick human quality assurance took 7 minutes. Scaled up, transforming all 70,000 records this way would take fewer than 39 hours. RPEAT reduces the time taken to prepare records for Archivematica by over 99% compared to manual processes.

The advice that the Digital Preservation Officer provided all those years ago was invaluable, and I think in particular the warning on “getting stuff out of RecordPoint” was pertinent – but I wish to expand on her point. The challenge is not unique to RecordPoint – the challenge is ETL in general. At a meeting of Australia and New Zealand’s digital preservation community of practice, Australasia Preserves, in early 2019, other archivists shared their struggle to do ETL from records management systems into their digital archive. This ability is an important addition to the growing suite of technical skills valuable to us digital preservation practitioners.

References

International Organisation for Standardisation. (2011). Information and documentation —

Principles and functional requirements for records in electronic office environments — Part 2:  Guidelines and functional requirements for digital records management systems  (ISO 16175-2). Retrieved from https://www.saiglobal.com/.

Header image

Artem Sapegin on Unsplash