Monday, August 11, 2014

Data-sharing and reproducibility

As data-sharing becomes more prevalent, so do discussions of the important topics surrounding data-sharing. Data citation and linking data to papers, metadata standards, infrastructure for data-sharing, legal aspects of re-using data, and so on are all topics that I have seen discussed quite frequently at places like the annual IASSIST conference and more broadly in the data curation & data-sharing community.

However, one topic that I haven’t seen discussed is something that I wonder about a lot myself. What do we mean by “reproducible research” and “replication,” and how does this interface with sharing data?

What do we mean by “replication”?

One of the main rationales for requiring and/or encouraging researchers to share data is that doing so will make it possible to replicate their research. 

Let’s pause, since this can get confusing. The words “replication” and “reproducible” tend to be used in different ways, varying by field or sometimes even by researcher. I see two main categories of activities, both of which are sometimes called “replication”: 
  • Re-analysis/robustness checks of the original study, using original data/code.
  • Conducting a new study with new data collection, similar in some or (almost) all ways to the original.
We could get more fine-grained, but these seem to be the basic categories. 

Because of the proliferation of naming schemas (I’ve seen about a dozen papers or blog posts with suggestions), I’m a little wary of using my own here. But because it’s much easier to use a single word, I’m going to refer to these basic kinds of replication with the labels “re-analyzing” and “reproducing” a study, respectively.


Data-sharing and re-analysis: 


Data-sharing is often connected explicitly to reproducibility of some kind. It’s one of the main justifications given by many journals for requiring that researchers share the data/code underlying published results, for example - i.e., so that their work can be replicated. (Note: it’s not so that peer reviewers can see the data/code as they decide on its reliability, at least not in the social sciences, which I’m more familiar with - even when these materials are required at publication, it seems uncommon for them to be used at all beforehand).


The basic idea, of course, is that anyone who is interested can go ahead and check your published results using the raw materials used to create them. As an added benefit, you’re more likely to be careful in checking your work if you have to share data/code, in addition to just your summary of end results.


Data-sharing and checking the reliability of the analysis:


Here’s the question that I wonder about: to what extent does what is (normally) shared allow someone to actually check the reliability of the analysis? 


When researchers share data and code for journal requirements, often this will be:


  1. A subset of the data that was collected e.g., what was used in the published results
  2. A subset of the code used to produce the final results e.g., the analysis code used to produce the tables. There is plenty of code that precedes this final stage code - e.g., all the code used to carry out operations like cleaning the data, merging datasets, transforming the collected data into new variables used in the analysis, and so on. 
So, what does it mean to check the analysis using these materials? It means - running the very end stage materials, essentially to see that the code runs without producing errors, and that the numbers in the tables do match up with what’s reported in the paper. It might also mean going through the code to see that the end stage analysis there does what is described (e.g., a regression controlling for XYZ).

But what you can’t check with these materials are, in a way, “deeper,” potentially more problematic things such as:

  • What decisions were made in the process of cleaning and transforming the data? For example: were there outliers excluded from the originally collected data (and if so, why)? When the variables were transformed, was this done as described (if it was described)? Were datasets merged properly?
  • Were there choices about what to report, and how to report it? If a subset of the data is shared, checking whether there was selective reporting of outcomes isn’t possible. It’s not even possible to know what was collected, unless the researcher mentions it in the paper. Were only certain age groups or other subgroups reported? Only a few outcomes of many surveyed? If you controlled for other variables, would that change the reported results? Hard to tell if the full range of variables that you could control for aren’t included in the data that is shared.

The question: partial vs. start-to-finish reproducibility?

By “start to finish reproducibility,” I mean: sharing data and code such that someone could track what you did from data collection to the point that you published results. Currently, it’s much more often that the data and code are shared, what this sharing allows is “partial reproducibility.” That is - what is shared are a subset of the materials used to produce the end results, and so start-to-finish reproducibility isn’t possible. 


So the question is: should researchers aim to share materials that would allow for start-to-finish reproducibility? Is that the ideal? 


Some further reflections: 

  • Start-to-finish reproducibility can be very difficult if you don’t set out to do it from the beginning. For one thing, keeping track of the code (cleaning, transforming variables, and so on) from start to finish is a difficult thing, particularly when one has multiple research assistants helping out throughout the life of the data (collection, cleaning, analysis).
  • Aiming to get data and code into comprehensible and well-organized shape early on makes it much easier. My sense is that what we need are good guidelines (and implementation of those guidelines) for structuring files, writing code, and managing data (e.g., labeling variables) throughout the study. With some effort from the outset, “start to finish reproducibility” is likely to be much more feasible.
  • It would be great to see more public discussion of how valuable it is to aim for and achieve start-to-finish reproducibility. My impression is that in some discussions, reproducible research is often just referred to as a goal without much talk of what this means (e.g. partial vs start-to-finish reproducibility). Connecting conversations of reproducible research to what this means -- and importantly, best practices for doing it -- seems pretty essential, to move forward.

Thursday, December 26, 2013

Great discussion of framing effects replication

Joseph Simmons and Leif Nelson recently wrote up the results of their replication attempt of a framing effects experiment. The experiment was done by David Mandel, and included an attempted replication of Tversky and Kahneman's "Asian Disease Problem" framing effect experiment. Mandel changed the wording of the original slightly to test its robustness. He added the word "exactly" so as to rule out misinterpretation of the original wording, which could potentially be read as meaning "at least." Simmons and Nelson attempted to replicate Mandel's results, and found a notably different outcome than Mandel.

What I want to focus on is not the details of the discussion, as interesting as they are. For the details, I'd recommend reading Mandel's original paper, the Simmons/Nelson replication and Mandel's reply. Instead, I want to comment on some great features of the exchange.

First, the discussion was remarkably rapid. Mandel's replication study was published in August 2013. Simmons and Nelson responded with their own replication, posted on their blog, in December. Mandel responded in detail to their replication on his own blog within a few days of receiving the drafted post from Simmons/Nelson. It's great to be able to read not just what Simmons/Nelson found, but also Mandel's take on it, without long delays.

Second, not only was the discussion rapid, but it's also high quality. Both Simmons/Nelson and Mandel seemed to engage really closely with what their experiment results showed, and also made the discussion accessible to readers.

Third, the tone is great. Simmons/Nelson point out why Mandel's replication is important and worth replicating:
The original finding is foundational and the criticism is both novel and fundamental. We read Mandel’s paper because we care about the topic, and we replicated it because we cared about the outcome.
Then, Mandel makes it clear right away that he is taking their replication in a collegial way. He titles it "AT LEAST I REPLIED: Reply to Simmons & Nelson's Data Colada blog post." Making a joke in his title is a nice signal of the spirit in which he takes their replication; he then adds:
First, let me say that when Joe contacted me, he noted that his investigation with Leif was conducted entirely in the scientific spirit and not meant to prove me wrong. He said they planned to publish their results regardless of the outcome. I accept that that’s so, and my own comments are likewise intended in the spirit of scientific debate.
A barrier to doing and discussing replications is that current academic incentives can make the practice awkward and professionally unrewarding. Researchers who replicate might not be able to publish their work since it's often not seen as original enough, and the authors whose work is replicated may not welcome the efforts. While the former issue isn't something that this discussion bears on, since the posts appear on blogs, it clearly is relevant to the latter.

Having examples of thoughtful exchanges like this one is a nice demonstration of what replication can be. At its best, it is detailed and thoughtful work that helps us sort through which effects we can more confidently rely on.

Thursday, December 5, 2013

Altmetrics and Cochrane reviews

"Altmetrics" is a term coined by Jason Priem, referring to a new, more comprehensive way of measuring the impact of scholarship. Whereas the usual ways of assessing an article's importance are where it is published (i.e., the journal, as ranked by impact factor) and its citation count, altmetrics aim to include measures how often it is discussed and mentioned in social media. This allows for a broader take on impact, as well as allowing impact measurement of a wider range of research outputs, such as datasets.

Altmetrics - in addition to being a new term - is also an organization. Its product is an embeddable icon and link to a scoring system. The system crawls social media sites (facebook, twitter, blogs, reddit, etc) for mentions of a particular paper, and then displays numbers of mentions as well as links.

While I'd heard of altmetrics (both the term and organization) some time ago and in general appreciated this development, I haven't seen the product in action until recently. Cochrane Collaboration, which I've written about before, now has embedded altmetrics in its abstract pages.

What this means is that when you search Cochrane summaries, and then look through a particular review's abstract, take for example "Antioxidant supplements for prevention of mortality in healthy participants and patients with various diseases," you can click on a link to an altmetrics page:




What I really liked while clicking through a few reviews and associated altmetrics pages is that I could easily see 1. the extent to which a review has been discussed and also to an extent 2. who had discussed it. What strikes me as great here is that it makes the small area of the internet that you might be obsessing over inter-connected in a way that it wasn't before. Of course we can always google to try to sleuth out who is saying what about a particular paper or subject, but this can be very time-consuming. When the sources are linked together through the altmetrics page, it's easier to quickly find others thinking and writing about the same paper.

For instance, I'm curious to discover others who are writing about Cochrane reviews, especially ones that I've taken an interest in (such as the one above). Altmetrics gives me an easier way to find them.

My main comment on what I'd like to see changed, though, is that the altmetrics page displays only a "subset" of the relevant mentions, while I'd like to be able to scroll through all the mentions (or at least a larger subset!). The goal now seems much more focused on quantifying the discussion, rather than on what I find the most valuable, which is easily finding others who are interested in the same topics.

Sunday, October 20, 2013

Research transparency landscape on Figshare

As a part of research consulting work this past summer, I wrote a landscape on funder data access policies and other resources.

The write-up was originally shared informally with an email list of funders and interested researchers. But then a researcher requested that I put it on Figshare so that it could be cited in a paper she's writing. It occurred to me that this landscape might be useful to a wider audience interested in research transparency/data-sharing. So here it is:

http://figshare.com/articles/Data_Access_Policies_Landscape/827268.

This was my first upload to Figshare, and it made me even more aware than before that Figshare is a great site! Easy and pleasant interface to use, with Creative Commons licenses (CC-BY and CC0 for data) that accompany all uploads to the site. I recommend using it for sharing papers and data.

Careers, caring and the unexpected (Part III)

Deciding between careers is a tough process. There are a lot of factors. Location, job security, salary, enjoyment, value to others/the world, replaceability (how easily your position could be filled by someone who would do it as well as you) - just to name a few. I spent quite a bit of time mulling over the decision of whether to continue as a professor or not.

Leaving academia was a particularly difficult decision. For one thing, the academic job market is so competitive - hundreds of eager applicants for each tenure-track job - that it's hard to go back once you leave.

I also had to get beyond the feeling that being an academic was only way to really be an intellectual. Looking back at it now that I no longer have this feeling, it seems ridiculous (of course you can be intellectual without being a professor! why not?) But there's a sense of this within academia. It's not made explicit exactly, but I believe it's quite pervasive, at least in the humanities. I'll always remember a fellow grad student who, deciding not to go on the academic job market, had printed and posted this article on the office door.

GiveWell offered me a full-time position after the summer trial period. In the end, after weighing everything up, the factor that really clinched my decision was that I wanted to be excited about my job. I knew what it was to have a really nice job, one that I was lucky to have. But I wanted to give and get more from my work. I gave notice at the college and, in January 2012, moved to NYC from Boston to work for GiveWell.

Right away, I loved being in New York. I had a great community of friends from college and elsewhere, and living in Brooklyn fit exactly what I was looking for. The job with GiveWell gave me the chance to work on a lot of interesting topics. I also really enjoyed always having people to talk to who had similar interests.

One of my favorite research topics quickly became "meta-research," GiveWell's term for initiatives aimed at improving research. This can involve a lot of things, but early on, the focus of GiveWell's work in this area was looking into the Cochrane Collaboration. As I've posted about, Cochrane does great systematic reviews of health interventions. I had a really interesting experience talking to a large number of people who work with Cochrane. On the basis of this research, GiveWell directed a grant to the US Cochrane Center (via Good Ventures).

The work on Cochrane became a gateway for me to other areas of meta-research. This work really fit into a main theme that had originally drawn me to GiveWell. I wanted better evidence for guiding decisions. As I began to learn more, it started to sink in that there are issues which affect not just philanthropic research but all research. Lack of transparency makes reported results less reliable, because we can't check them. Publication models which encourage and reward "interesting" results lead to a system where we can't trust that positive findings reflect how things really are (rather than what's likely to get published).

OK, so what's happened in the past year? (I'm going to speed through a bit, since it's harder to take a bird's-eye view of things that have happened in the past year as opposed to say, 5 years ago.)

First, GiveWell moved to San Francisco and I stayed in NYC. There were many reasons for this, some of them personal, but a big one for me was that I love New York and feel at home here and a part of a community. I've remained a big fan of GiveWell after moving on from being a researcher with the group. After some time considering my next step, I became a research consultant with another philanthropic advisor in NYC, which allowed me to follow up further on my interest in improving research.

Through that work, I thought of an idea for an initiative to increase replications of studies being done and shared (i.e., re-analyses). I've recently received a planning grant from a foundation to develop this project, and I'll be writing further posts on it as the work gets underway.


Careers, caring and the unexpected (Part II)

I emailed GiveWell immediately after finding the site, saying something effusive like "What you're working on is really great. Can I help in some way?" Elie (a co-founder of GiveWell) soon wrote back, and I began to do things like check footnotes and sources on GiveWell pages as a volunteer. It was summertime between semesters teaching, and I spent quite a lot of time on this. It might not sound super-exciting to check footnotes for errors, but my excitement about GiveWell carried over to the task.

In the fall, GiveWell asked if I'd be interested in part-time research consulting work. The research that I did initially was focused on the "Malthus" question of whether aid in developing countries might, through saving lives, lead to overpopulation and increased scarcity of resources. This is a big question, and I focused on the sub-question of the relation between child mortality and a reduction in fertility (i.e. average children per woman). Some researchers - examples are Jeffrey Sachs and Hans Rosling - argue that there's a causal relation between the two. That is: if you save childrens' lives, this will lead to lower birthrates, as parents decide to have fewer children. Of course, as with many questions in development that involve lots of correlated variables, it's notoriously difficult to make well-supported causal inferences. In my research, I came across 50+ papers which offered conflicting views on this question.

Through GiveWell work, I learned that I really enjoyed research on empirical questions. Prior to this, I'd always thought of myself as a "humanities person." In college, I took a lot of classes in literature, history and philosophy. In grad school, I focused on theoretical problems in epistemology and metaphysics. I'd missed out on a side of myself. This is a side that really enjoys puzzling over applied questions e.g. comparing the effectiveness of programs aimed at helping people, learning about statistics and empirical methods, and so on.

I kept doing GiveWell work during my second year as an assistant professor. The more of this kind of work I did, the more excited about it I felt. I started thinking about whether I wanted to stay in philosophy.

On the one hand, teaching philosophy was really nice job. It involved having interesting conversations on a beautiful campus about topics that I found generally enjoyable. I didn't anticipate a torturous route to tenure because publishing requirements weren't sky-high (I'd have to work for it, but thought it would be manageable). I liked my colleagues and students. On the other hand, I didn't feel my heart beat faster at the thought of teaching philosophy for the next 30 years. I wasn't getting up in the morning eager to expand my understanding of topics I thought were important. But I did feel that way about the work I did for GiveWell.

It was a hard decision to make. I asked if I could work with GiveWell full-time over the following summer as a trial period. I spent much of the time researching the evidence on the effectiveness for cash transfers, and two months went by very quickly.

(To be continued in the next post...)

Tuesday, October 15, 2013

Careers, caring and the unexpected (Part I)

I'm often interested in other peoples' stories of how they got to where they are now. I like Gig, in which people talk about their jobs. I like What Should I Do With My Life? (though I folded its bright red cover page over when I read it on the subway). So I too want to write about how I arrived where I am now.

I'm now interested in evidence, often in methodological questions. I work on and often think about what research we can trust and why, and how to improve research. These topics probably seem a bit abstract or removed. How did I come to care about these things?

It started in a soup kitchen. I was a 3rd year graduate student in philosophy, and I had enough time during a fellowship year to volunteer. I wanted to do something besides anxiously wading through one potential dissertation topic after another. So I went down to the soup kitchen one night, prepared to help. The people there for food all sat down, and waited as they were each brought a dinner tray. I stood in line with about 10 other volunteers, waiting to take my turn to bring someone a tray of food. I decided that there had to be some other organization that needed my help more.

I learned that mentoring organizations often needed volunteers and that the waitlist for mentors was hundreds of kids long. So I interviewed and after passing the requisite checks, I was assigned to be a mentor to a 13 year old in foster care. We met the following week, and she shyly answered the door. We chatted for awhile, and then the next week I took her horseback riding. Every week she seemed to come a bit more out of her shell, as we went out to dinner, cooked at my house and played board games.

About a year after meeting my mentee, I started looking into foster care. I was pretty shocked to read the statistics about what happens to kids who "age out" of foster care at 18-- high rates of homelessness, dropping out of high school, and so on. I cared about my mentee and, thinking about her future, I started to feel worried.

That led me to care about others in a similar situation to her, and to want to do more to help organizations that work with foster kids. I decided to take a summer while still in grad school to do a couple of internships with non-profits: first Covenant House in NYC, and then an organization called Youth Advocacy Center. I enjoyed these internships but also wondered about the impact they were having. I thought they were probably helping kids they worked with, but also just didn't see any evaluation of their programs. I started to feel a bit skeptical. I decided I'd continue in academic philosophy but spend time doing more non-profit work in my free time.

I got my PhD after my 5th year, and was lucky to land a tenure-track job at Stonehill College. I taught philosophy courses to undergrads. I continued to mentor, driving down to NYC every month to see my mentee and relatives there, and also started to volunteer for other organizations near Boston (e.g., Posse, which gives college scholarships, and Prison Books Program).

The summer after my first year as an assistant professor, I was thinking about why there wasn't a group doing research on the impact non-profits were having. I didn't think Charity Navigator was pursuing that very well - overhead seemed like a pretty poor measure - and was googling around to see if I could find any better research. I found GiveWell, and was immediately excited about the group. It was clearly what I'd been wanting to find.

[To be continued in the next post.]