As data-sharing becomes more prevalent, so do discussions of the important topics surrounding data-sharing. Data citation and linking data to papers, metadata standards, infrastructure for data-sharing, legal aspects of re-using data, and so on are all topics that I have seen discussed quite frequently at places like the annual IASSIST conference and more broadly in the data curation & data-sharing community.
However, one topic that I haven’t seen discussed is something that I wonder about a lot myself. What do we mean by “reproducible research” and “replication,” and how does this interface with sharing data?
What do we mean by “replication”?
One of the main rationales for requiring and/or encouraging researchers to share data is that doing so will make it possible to replicate their research.
Let’s pause, since this can get confusing. The words “replication” and “reproducible” tend to be used in different ways, varying by field or sometimes even by researcher. I see two main categories of activities, both of which are sometimes called “replication”:
- Re-analysis/robustness checks of the original study, using original data/code.
- Conducting a new study with new data collection, similar in some or (almost) all ways to the original.
Because of the proliferation of naming schemas (I’ve seen about a dozen papers or blog posts with suggestions), I’m a little wary of using my own here. But because it’s much easier to use a single word, I’m going to refer to these basic kinds of replication with the labels “re-analyzing” and “reproducing” a study, respectively.
Data-sharing and re-analysis:
Data-sharing is often connected explicitly to reproducibility of some kind. It’s one of the main justifications given by many journals for requiring that researchers share the data/code underlying published results, for example - i.e., so that their work can be replicated. (Note: it’s not so that peer reviewers can see the data/code as they decide on its reliability, at least not in the social sciences, which I’m more familiar with - even when these materials are required at publication, it seems uncommon for them to be used at all beforehand).
The basic idea, of course, is that anyone who is interested can go ahead and check your published results using the raw materials used to create them. As an added benefit, you’re more likely to be careful in checking your work if you have to share data/code, in addition to just your summary of end results.
Data-sharing and checking the reliability of the analysis:
Here’s the question that I wonder about: to what extent does what is (normally) shared allow someone to actually check the reliability of the analysis?
When researchers share data and code for journal requirements, often this will be:
- A subset of the data that was collected e.g., what was used in the published results
- A subset of the code used to produce the final results e.g., the analysis code used to produce the tables. There is plenty of code that precedes this final stage code - e.g., all the code used to carry out operations like cleaning the data, merging datasets, transforming the collected data into new variables used in the analysis, and so on.
So, what does it mean to check the analysis using these materials? It means - running the very end stage materials, essentially to see that the code runs without producing errors, and that the numbers in the tables do match up with what’s reported in the paper. It might also mean going through the code to see that the end stage analysis there does what is described (e.g., a regression controlling for XYZ).
But what you can’t check with these materials are, in a way, “deeper,” potentially more problematic things such as:
- What decisions were made in the process of data-cleaning and variable construction? For example: were there outliers excluded from the originally collected data (and if so, why)? When the variables were constructed from raw data, which choices were made and why? Were datasets merged properly?
- Were there choices about what to report, and how to report it? If a subset of the data is shared, checking whether there was selective reporting of outcomes isn’t possible. It’s not even possible to know what was collected, unless the researcher mentions it in the paper. Were only certain age groups or other subgroups reported? Only a few outcomes of many surveyed? If you controlled for other variables, would that change the reported results? Hard to tell if the full range of variables that you could control for aren’t included in the data that is shared.
By “start to finish reproducibility,” I mean: sharing data and code such that someone could track what you did from data collection to the point that you published results. Currently, it seems to me to be much more often the case that shared materials allow for “partial reproducibility.”
So the question is: should researchers aim to share materials that would allow for start-to-finish reproducibility? Is that the ideal and what we should be aiming for?
Some further reflections:
- Start-to-finish reproducibility can be very difficult if you don’t set out to do it from the beginning. For one thing, keeping track of the code (cleaning, transforming variables, and so on) from start to finish is a difficult thing, particularly when one has multiple research assistants helping out throughout the life of the data (collection, cleaning, analysis).
- There is also a question of how to achieve this full reproducibility when in some cases, PII (personally identifiable information) would come in, when running the full process start to finish. Obviously, PII would never be shared publicly.
- Aiming to get data and code into comprehensible and well-organized shape early on makes it much easier. My sense is that what we need are good guidelines (and implementation of those guidelines) for structuring files, writing code, and managing data (e.g., labeling variables) throughout the study. With some effort from the outset, “start to finish reproducibility” is likely to be much more feasible. Some groups in social science such as the Berkeley Initiative for Transparency in the Social Sciences are making progress on creating such a guide.
- It would be great to see more public discussion of how valuable it is to aim for and achieve start-to-finish reproducibility. My impression is that in some discussions, reproducible research is often just referred to as a goal without much talk of what this means (e.g. partial vs start-to-finish reproducibility). Connecting conversations of reproducible research to what this means -- and importantly, best practices for doing it -- seems pretty essential, to move forward.