“Data archives have been around for some time, but they are more relevant nowadays than ever”. Interview with the Social Sciences Data Archive project

The inter-institutional project SODA (Social Sciences Data Archive) aims to develop a prototype for a data archive as Belgian representative in the Consortium of European Social Science Data Archives (CESSDA) and beyond.

To learn more about SODA, Archivoz’s Llarina González spoke with researchers from the project:

Benjamin Peuch, researcher on information science, manages Dataverse and studies the needs of researchers, archivists and historians for the correct custody and description of digital objects.
Freya De Schamphelaere, legal researcher, examines the role of the data archive in the Belgian context and the impact of the Directive on copyright and open data.
Jean-Paul Sanderson and Laura Van den Borre, demographers, are the link between SODA and social scientists.

(Archivoz) How does SODA operate in Belgium’s complex institutional landscape?

Since 1970, Belgium has been going through an institutional evolution from a unitary entity to a federal State with three regions and three linguistic (non-corresponding) communities. The competences of the State were redistributed between those six entities. In this context, the State Archives remained part of the federal administration and became a scientific research institution.

SODA therefore combines two approaches: a centralised perspective with a data archive representing Belgium as a whole amid CESSDA; and a decentralised perspective with the data archive of the State Archives catering to both researchers and affiliated institutions at the federal level and to those at the levels of communities and universities, though universities and communities are developing their own institutional repositories.

SODA would thus only be one actor, although an international one as the CESSDA representative, in a network of Belgian repositories. Cooperation between different Belgian actors will be key to make research data findable for researchers from all backgrounds.

(Archivoz) What is the role of the State Archives of Belgium in SODA?

SODA originated in the world of social sciences at the Dutch-speaking Vrije Universiteit Brussel and the French-speaking Université Catholique de Louvain. The State Archives were brought in for their expertise in archival science and eventually became the coordinating institution, with two full-time researchers tasked with creating the deliverables of the project.

The State Archives investigate issues such as possible business models and legal entities of the future data archive, metadata and data quality requirements, transfer agreements with depositors, and so on. Universities link SODA to the research community by surveying the needs and documenting the practices of researchers.

(Archivoz) Can you clarify the concept of ‘data archive’?

Data archives have been around for some time but they are more relevant nowadays than ever. Essentially, data archives are like traditional archive institutions but dedicated to the preservation of research data and to make them available for reuse. Much of scientific research is performed thanks to public funding, therefore the ensuing data belong to the public and must be open for reuse. The nature of research data, the archival precautions they require, and the particular needs of scientists make it necessary to build specific archive facilities: data archives.

(Archivoz) We would like to know more about SODA’s relationship with scientists.

Scientists are our key users: both our main data providers and the prime potential data reusers. Their needs in terms of research data management, both for the phase of data archiving (ingest in OAIS terms) and data reuse (access), must be regularly surveyed and accounted.

But in the future, we will investigate whether other types of users might be interested in accessing social science research data. Could journalists, teachers, genealogists find it interesting to integrate such datasets in their corpuses? This entails a proactive policy to foster new user communities.

(Archivoz) Tell us about standards and formats in SODA.

At first sight, the amount of formats and standards in such a context can be daunting! Data-wise, it’s not so bad because most files produced by researchers either already exist in open formats or can be converted using the data ingest and dissemination software Dataverse.

But in terms of metadata, things get slightly more complicated. Most CESSDA members follow the international standard for documenting datasets in social sciences, the Data Documentation Initiative (DDI). DDI encourages the recording of a wide range of information about datasets, including:

  • administrative metadata: dataset producers, principal investigators, sponsoring organization(s), repository responsible for providing access, etc.
  • technical and descriptive metadata specific to the social sciences: methods of data collection, cleaning and control operations, aggregation and analysis, kind(s) of data, the universe of the study, variables, etc.

Incorporating traditional archivists into the project highlighted the lack of historical metadata, which are meant to describe in detail the context of production of datasets with information such as biographies of researchers, descriptions of research centres, contextualisation, etc. These are more long term-focussed metadata compared with DDI, which addresses the immediate needs of social scientists seeking reusable data. Historical metadata will help researchers 10, 20, even 50 or 100 years from now to understand how and why datasets were produced. Such metadata will likely have to be recorded by historians and social scientists with an interest in the history of science.

(Archivoz) What are your thoughts on the new movements for opening research data in all scientific areas?

As Ron Dekker noted in Lisbon in May 2017, private pharmaceutical companies share their data in a sort of “pre-competitive stage” because they know they will all greatly benefit from doing so even though they are commercial rivals. Is the same tendency spreading to all scientific fields? Hopefully, this rather denotes a new spirit of sharing.

From a legal perspective “open data” is not just an invitation anymore. Since the publication of the European directive 2019/1024 on open data and public sector information (the third of the PSI directives) it has become law: publicly-funded research data must be open for reuse by default through an institutional repository in accordance with the “as open as possible, as closed as necessary” rule of thumb and the FAIR principles. It will be very interesting to see in the coming years how this directive can open up more publicly-funded research data, how researchers and data archives will adapt to it, and if it can contribute to new developments toward open science and linked research data.

(Archivoz) What are the main problems you have faced?

We seek to offer tools that fit the needs of our users by customizing the Dataverse software to simplify the deposit procedure and ease the search process for reusable data. This involves reconciling the needs of several stakeholders. We are currently gathering beta-testers among Belgian social science researchers (our key users) to this end.

SODA is part of a European consortium so we must also work for researchers abroad. For example, this entails translating the title and description of our datasets in English. We must also allow the CESSDA Data Catalogue (CDC) to harvest our metadata through an Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) and comply with the CESSDA Core Metadata Model, a list of conceptual metadata elements to which CESSDA service providers must map their own metadata so that the metadata harvesting can take place.

Legally speaking the main problem is the unclarity of certain obligations, leading to uncertainty among researchers, who will prefer to “play it safe” and not open up their research data. We are wrapping up our research on legal open data obligations and writing the standard licenses and guidelines that will ensure researchers are aware of the implications — but also and especially the advantages — of making their data “as open as possible.”

(Archivoz) What importance do you consider that research data and projects such as SODA have in the lives of citizens?

It might seem that data archives are only remotely relevant to citizens’ interests since it is such a niche infrastructure with such complex technical needs and purposes. But just like traditional archives, data archives play an essential part in a modern society; the archives of science are vital for scientific and social progress.

Reevaluation and reproducibility are fundamental to the credibility of science. However, they can occur only if documents and data from studies are preserved. For example, the Stanford Experiment, long presented as incontrovertible proof of the evil and corruptibility of humans, was recently reevaluated (1) (2) and its soundness was heavily reconsidered against modern standards for rigorous and ethical scientific experimentation.

Data archives will take on increasingly larger and more complex databases Click To Tweet
(Archivoz) If you had to highlight an element of the project, what would it be?

Legally there are several challenging yet interesting issues, for example how the Belgian Archival Act affects a data archive. According to the Archival Act, archived files can only be accessible after 30 years. Yet this is not what a data archive tends to do. Here the principle is: all data must be open unless there is a good reason (copyright, personal data, other). SODA will work through private deposit agreements for archiving and opening up the data.

(Archivoz) How do you imagine the future of data archives in social sciences in a few years?

Data archives will take on increasingly larger and more complex databases. Such a challenge can be tackled by reinforcing the network of data archives — of which CESSDA is a sterling example — and by sharing experience and know-how. We also think that social science data will become a booming business (if it is not already, all things considered). An extra challenge in this respect will be to keep the focus of our efforts on science and not so much on financial gain.