Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.
Imagine a data platform that can help improve community resilience to natural disasters, avoid potential supply chain disruptions and accurately predict infectious disease outbreaks.
Those are among the goals of a new data platform being developed by the University of Michigan’s Institute for Social Research (ISR), which was awarded a $38 million investment from the National Science Foundation (NSF) earlier this year.
The new data platform will enable researchers in multiple fields to more effectively collect, store and secure vital information for their studies. In the past, many researchers have faced obstacles such as incompatible data standards, missing or error-filled information and technical difficulties in managing large datasets.
The $38 million investment by the NSF is enabling the Institute for Social Research to establish the Research Data Ecosystem: A National Resource for Reproducible, Robust and Transparent Social Science Research in the 21st Century. ISR will oversee the creation of new data archives and software that researchers can use to access, organize, analyze and contribute data.
“The Research Data Ecosystem (RDE) is a five-year project and is expected to be completed by the end of 2026,” explained Jeannette Jackson, managing director of the RDE.
The work on RDE began on January 17, 2022, and is now in the early stages of construction.
“The first products will be available in 2024,” Jackson noted. “The end result will be a flexible data management system with a user-friendly interface that will enable researchers to deposit, search for, make use of the cloud to work with their data and disseminate their data in a safe and secure environment. The ultimate goal is to make it easy for researchers to find data and create new knowledge.”
An urgent need for better quality research data
The Research Data Ecosystem infrastructure project was initiated because ISR recognized the need to provide better data management and analytics support for researchers engaged in cutting-edge social science, Jackson said. ISR is the largest academic social science survey and research organization in the world. The RDE work is situated within ISR at the Inter-university Consortium for Political and Social Research (ICPSR), the world’s largest social science archive specializing in curated data.
“RDE is a transformative infrastructure project that will modernize the ICPSR software platform and develop an integrated suite of software tools to advance research in the social and behavioral sciences with a focus on the democratization of data,” according to Margaret “Maggie” Levenstein, director of ICPSR and primary investigator for the RDE.
Per Levenstein, the RDE will enable:
- Interoperability: An integrated system for the entire research data lifecycle, so that work done early in the data lifecycle is useful at later stages, making it possible to integrate data from different sources.
- Reproducibility: Making it easier to reproduce and build on prior research results by being able to find and reuse data and code.
- Transparency: Providing information about provenance, including source, code and method of collection for research data.
- Efficiency of data sharing: Reducing burden on data producers in sharing data and ensuring that shared data are FAIR (findable, accessible, interoperable, reusable).
- Confidentiality protection: Protecting confidentiality while increasing research access.
To achieve these goals, the project will develop the Research Data Description Framework for describing different research data lifecycle events. This is a metadata specification similar to the Resource Description Framework, Levenstein said.
“RDE will include stand-alone functional components for each stage of the research lifecycle that will be interoperable with one another and with key existing global research infrastructure,” Levenstein said. “The platform will support social and behavioral science researchers using traditional (e.g., survey and experimental) and novel (e.g., digital trace, imaging) types of data over the entire research lifecycle, from data collection to analysis to sharing to rediscovery and re-analysis.”
This infrastructure will improve the quality, integrity and safety of data. It will also increase accessibility to data and collaboration between users across social science and behavioral science disciplines. It will do so with a user interface designed to make data more accessible across the board, Levenstein said.
Turning mountains of data into nuggets of insight
The new RDE platform basically seeks to solve a problem that is shared in virtually every industry – organizations collecting mountains of data that don’t always communicate with each other, and makes it difficult to find meaningful insights in it.
“ICPSR began constructing digital archives for social science data in the 1960s to preserve and disseminate the novel data that ISR researchers were creating,” Jackson said. “At that time, each dataset was created with its own bespoke framework, permissions, metadata, etc.”
Since then, advances in the ability of the IST to collect data have led to a massive influx of different data types and sizes. Once the ICPSR software platform is modernized, these datasets can be linked to inform research within the social sciences.
“Using bespoke environments is extremely expensive in terms of time and money for both researchers and data providers,” Jackson said. “The resulting data are not interoperable with other parts of the research ecosystem. This increases a researcher’s burden and reduces the quality, transparency and reproducibility of research. RDE will accomplish these efficiently, at scale and in a way that enhances the scientific standards of social science research.”
The RDE platform is being built upon a new infrastructure (OpenShift/Kubernetes) with updated cloud-native technologies. The platform consists of a set of shared services which cover functions including ingest, curation, search, dissemination, preservation, authentication and authorization.
“The platform will improve the quality of data-driven social and behavioral science research over the entire data lifecycle,” Levenstein said. “This, in combination with a human-centered design interface, will enable researchers across disciplines to conduct their work more efficiently and to create, organize, archive, access and analyze data in ways that they cannot with existing infrastructure. The new infrastructure will also facilitate interactions between other parts of the research ecosystem through a system of APIs.”
The broader goals of social research
The NSF has invested in the new data platform in order to help advance social science research capabilities, which are aimed at benefitting all citizens.
“Research in the social, behavioral and economic sciences aims to improve understanding of human behavior: how we create, respond to and are shaped by the natural and social worlds,” Jackson said. “Progress in the social sciences enables effective, high-quality decision-making – by individuals, parents and families, civic participants and civil society organizations, businesses and evidence-based policymakers.”
An empirical renaissance across the social sciences – in which scientists are using new computational methods, new experimental approaches and new data sources – has transformed our understanding of human society, from the determinants of inequality to how children learn to read, Jackson stressed.
“These innovations in knowledge were enabled by researchers who gained access to large, novel data – digital traces of human activity – which they plumbed for new insights. NSF has recognized that data abundance creates enormous opportunities: harnessing the Data Revolution is one of its priorities,” Jackson said.
NSF has made considerable investments in ICPSR throughout its history, including facilitating the move from tape drives to the internet.
“We believe that in addition to bolstering the investments they have already made in the social science archives at ICPSR that NSF now recognizes the need to invest in the ability to work with bigger, more connected data in the cloud,” Jackson said.
To understand the significance of the investment, Jackson shared an example.
“Imagine you would like to study a particular ZIP code that is known to have specific adverse health conditions. You could come to ICPSR and safely and securely identify all sorts of studies and data from this ZIP code (EEG data, survey data, video data, geospatial data, criminal justice data, educational data, etc.),” she said. “You could then conduct research in the cloud in a way that was never been possible before. RDE, once built, and in conjunction with the work being done at ICPSR to curate data, will enable the research community at all levels to do just that.”