As data-sharing becomes more prevalent, so do discussions of the important topics surrounding data-sharing. Data citation and linking data to papers, metadata standards, infrastructure for data-sharing, legal aspects of re-using data, and so on are all topics that I have seen discussed quite frequently at places like the annual IASSIST conference and more broadly in the data curation & data-sharing community.
However, one topic that I haven’t seen discussed is something that I wonder about a lot myself. What do we mean by “reproducible research” and “replication,” and how does this interface with sharing data?
What do we mean by “replication”?
One of the main rationales for requiring and/or encouraging researchers to share data is that doing so will make it possible to replicate their research.
Let’s pause, since this can get confusing. The words “replication” and “reproducible” tend to be used in different ways, varying by field or sometimes even by researcher. I see two main categories of activities, both of which are sometimes called “replication”:
- Re-analysis/robustness checks of the original study, using original data/code.
- Conducting a new study with new data collection, similar in some or (almost) all ways to the original.
Because of the proliferation of naming schemas (I’ve seen about a dozen papers or blog posts with suggestions), I’m a little wary of using my own here. But because it’s much easier to use a single word, I’m going to refer to these basic kinds of replication with the labels “re-analyzing” and “reproducing” a study, respectively.
Data-sharing and re-analysis:
Data-sharing is often connected explicitly to reproducibility of some kind. It’s one of the main justifications given by many journals for requiring that researchers share the data/code underlying published results, for example - i.e., so that their work can be replicated. (Note: it’s not so that peer reviewers can see the data/code as they decide on its reliability, at least not in the social sciences, which I’m more familiar with - even when these materials are required at publication, it seems uncommon for them to be used at all beforehand).
The basic idea, of course, is that anyone who is interested can go ahead and check your published results using the raw materials used to create them. As an added benefit, you’re more likely to be careful in checking your work if you have to share data/code, in addition to just your summary of end results.
Data-sharing and checking the reliability of the analysis:
Here’s the question that I wonder about: to what extent does what is (normally) shared allow someone to actually check the reliability of the analysis?
When researchers share data and code for journal requirements, often this will be:
- A subset of the data that was collected e.g., what was used in the published results
- A subset of the code used to produce the final results e.g., the analysis code used to produce the tables. There is plenty of code that precedes this final stage code - e.g., all the code used to carry out operations like cleaning the data, merging datasets, transforming the collected data into new variables used in the analysis, and so on.
So, what does it mean to check the analysis using these materials? It means - running the very end stage materials, essentially to see that the code runs without producing errors, and that the numbers in the tables do match up with what’s reported in the paper. It might also mean going through the code to see that the end stage analysis there does what is described (e.g., a regression controlling for XYZ).
But what you can’t check with these materials are, in a way, “deeper,” potentially more problematic things such as:
- What decisions were made in the process of cleaning and transforming the data? For example: were there outliers excluded from the originally collected data (and if so, why)? When the variables were transformed, was this done as described (if it was described)? Were datasets merged properly?
- Were there choices about what to report, and how to report it? If a subset of the data is shared, checking whether there was selective reporting of outcomes isn’t possible. It’s not even possible to know what was collected, unless the researcher mentions it in the paper. Were only certain age groups or other subgroups reported? Only a few outcomes of many surveyed? If you controlled for other variables, would that change the reported results? Hard to tell if the full range of variables that you could control for aren’t included in the data that is shared.
By “start to finish reproducibility,” I mean: sharing data and code such that someone could track what you did from data collection to the point that you published results. Currently, it’s much more often that the data and code are shared, what this sharing allows is “partial reproducibility.” That is - what is shared are a subset of the materials used to produce the end results, and so start-to-finish reproducibility isn’t possible.
So the question is: should researchers aim to share materials that would allow for start-to-finish reproducibility? Is that the ideal?
I don’t know the answer to this question. But here are some further reflections:
- Start-to-finish reproducibility is very difficult, especially if you don’t set out to do it from the beginning. There are plenty of hard problems to tackle, if one wants the ideal. For one thing, keeping track of all the analysis files and what they do from start to finish is a difficult thing, particularly when one has multiple research assistants helping out throughout the life of the data (collection, cleaning, analysis). It seems far from the current norm to have a master .do file that can pull in all the disparate code that transforms the raw data all the way through to the published tables.
- Aiming to get data and code into comprehensible and well-organized shape early on makes it much easier. My sense is that what we need are good guidelines (and implementation of those guidelines) for structuring files, writing code, and managing data (e.g., labeling variables) throughout the study. With some effort from the outset, “start to finish reproducibility” is likely to be much more feasible.
- It would be great to see more public discussion of how valuable it is to aim for and achieve start-to-finish reproducibility. My impression is that reproducible research is often just referred to as a goal without much discussion of what this means (e.g. partial vs start-to-finish reproducibility).