Scientific reproducibility, data management, and inspiration

“Science moves forward by corroboration–when researchers verify others’ results,” the journal Nature states in its July special edition on Challenges in Irreproducible Research.  “There is a growing alarm about results that cannot be reproduced. . . . Journals, scientists, institutions and funders all have a part in tackling reproducibility.”

Stefano Allesina discusses a data management plan with Elisabeth Long, who points sto the plan on screen.
Librarian Elisabeth Long (left) discusses a data management plan with Professor Stefano Allesina. (Photo by Joel Wintermantle)

Science faculty across the disciplines are increasingly taking up the challenge to publish their research in ways that are more easily reproduced, and librarians are collaborating with these researchers to ensure that rigorously collected data, metadata, and algorithms are preserved and made accessible to the research community.

“Many of these efforts revolve around teaching, planning, and practicing excellent data management throughout the research life cycle, from grant writing to publication,” said Elisabeth Long, Associate University Librarian for Information Technology and Digital Scholarship.  “The University of Chicago Library is offering a growing set of data management research and teaching services that help UChicago scientists win grants and produce and publish reproducible results that will shape the future of their fields.”

Teaching good data management from the beginning

The UChicago Biological Sciences Division recently played a leading part in improving graduate education in its discipline by developing a National Science Foundation-funded course called Responsible, Rigorous, and Reproducible Conduct of Research: R3CR.  All UChicago first-year BSD graduate students are required to take the course, learning how to use current methods in computational biology in an ethical and reproducible way.  Elisabeth Long has partnered with the course’s creators, Professors Victoria Prince, Stefano Allesina, and Stephanie Palmer, to provide a class session that introduces students to the principles of data management in the lab setting.

“Biology produces a lot of data, and we have seen the kind of mistakes that people can make that are terrifying,” Professor Allesina said. “Elisabeth talked a lot about how you make sure that you’re keeping your data safe throughout your thesis research: how you should name your files, where you should save your files, how you make sure they are saved for posterity, and where there are institutional repositories or online repositories where you can publish your data.”

The Library is partnering with researchers across campus to develop practices and tools that can facilitate the kind of recordkeeping and data curation that is currently demanded of scientists.  Librarians are offering workshops and training sessions that prepare University of Chicago students to graduate with exceptional data management and preservation skills.

Electronic lab notebooks and data management plans

This Autumn Quarter, the Library’s new Center for Digital Scholarship begins offering drop-in consultation hours and customized one-on-one sessions to work with faculty on their data management plans, choosing between the University’s Knowledge@UChicago research repository and disciplinary archives for preserving and sharing research outputs.

The Center will also offer advice on selecting and using research management tools such as electronic lab notebooks and the Open Science Framework.  Research management tools provide platforms where faculty can centralize all their research activities, enabling easy file management, version control, protocol sharing, analysis activities, email, and other interactions between members of a lab. “One challenge confronting researchers is choosing from among the many existing systems,” Long said. “The Center for Digital Scholarship’s consultation services can pair librarians with individual faculty members, or bring sessions to your labs to explore the best solution for your particular research scenario.”

When the data don’t stand alone

Complex research workflows that present particular challenges for reproducibility often occur in fields where data are processed multiple times before final analysis. “In such cases, preserving the data alone is insufficient to support reproducibility,” Long explained. “The computational code for processing the data must also be preserved along with its relation to the data at various stages of processing.”

Marco Govoni, a researcher at the Institute of Molecular Engineering and Argonne National Laboratory, has been developing a tool for mapping and documenting these relationships.  Qresp: Curation and Exploration of Reproducible Scientific Papers (at qresp.org) guides the researchers through the process of documenting the relationship between the datasets, scripts, tools, and notebooks that were used in the creation of a scientific paper. Librarians are working with Govoni to explore ways in which the Library could support his work and potentially integrate it with the Library’s new institutional repository platform.

Data and inspiration

In consulting with librarians, faculty sometimes discover unexpected sources of data, inspiring new research projects.  When Long was talking to the R3CR class about data management and how they will submit their dissertations to ProQuest, a national dissertation repository, Professor Allesina began to consider the value its metadata could provide for the study of careers in science.  “There’s a lot of interest in trying to see if we can improve the situation in the sciences by increasing representations, for example, of women or minorities,” Allesina explained, “but one thing that we lack is some sort of longitudinal analysis, because once PhD students are out the door, it’s very difficult to find them again.”

Librarian Nora Mattern, Professor Stefano Allesina, and a sketch of a computational pipeline. (Photo by Joel Wintermantle)

At Allesina’s request, Long put him in touch with the Library’s Director of Technical Services, Kristin Martin, who worked with ProQuest to obtain the name, institution, and year of graduation for dissertation authors from the U.S. and Canada from 1993 to 2015.  He is now planning to combine that metadata with publication data from Scopus to track the length and locations of scientists’ careers in academia.

Such a study raises specific reproducibility challenges.  In working on a grant proposal to the National Science Foundation to support this research, Allesina turned to Nora Mattern, Scholarly Communications Librarian, and Debra Werner, Director of Library Research in Medical Education, for advice on how to integrate proprietary data owned by ProQuest and Scopus into the data management plan.  “How much can you share with other scientists?” Allesina asked.  “Can you share some summary statistics of the data?  Can you share de-identified data? If you imagine that someone wants to repeat my analysis of PhD students, will they have sufficient data?” Mattern and Werner helped him to structure the data management plan and to consider the legal implications.

When Allesina came to the United States from Italy, he was surprised at the role he found librarians taking in the digital age.  “Here librarians are thinking forward,” he said.  “Nowadays we have this mass of information. How do we navigate that? How do we organize it? How do we make it searchable? I am always amazed that people can be so helpful. I was dreaming of this data about PhDs, and I talked to Elisabeth, and she said ‘let me look into that.’ After a few weeks, I got gigabytes of data.”

His advice to colleagues: “Run it by a librarian before giving up.”

To consult with a librarian on data management and scientific reproducibility, talk to your Library subject specialist or email data-help@lib.uchicago.edu.