The Data-Intensive Social Science Center (DISSC) serves as a hub for social scientists at Yale who work with big data and secure datasets. The center provides support throughout the research lifecycle—assisting researchers in identifying appropriate data sources, establishing secure computing environments, and navigating applications for restricted data access.
A core aspect of DISSC’s mission is to ensure that large datasets are structured in ways that enable researchers to work with them efficiently. One notable example is the L2 voter database, which arrives as over 7 terabytes of raw data.
DISSC collaborated with the Yale Center for Research Computing (YCRC) to develop automated Nextflow pipelines that transform this massive raw dataset into optimized columnar formats, significantly improving query performance. Additionally, an Apache Spark cluster was implemented to allow horizontal scaling of processing power, which is essential when handling datasets of this magnitude.
In another pilot initiative, DISSC explored cloud-based querying for large-scale datasets. Working again with YCRC, the team established an ODBC connection that enables researchers to run queries against cloud-hosted datasets and retrieve results for local analysis at YCRC.
DISSC’s overarching goal is clear: to allow researchers to concentrate on their research rather than on the complexities of data infrastructure.