Data Sciences for the Life Sciences


From ResearchNext the Research Digest of UMass Amherst

Statistical, life science, and social science researchers gathered for a workshop on “Data Sciences for the Life Sciences in a High Performance Computing Environment” in February: the first formal opportunity for the researchers to learn how to effectively utilize the MGHPCC facility.

A group of more than 40 statistical, life science, and social science researchers braved the cold and dark of an early February morning to be part of two firsts in regional high performance computing efforts in the life sciences. Their participation in the workshop “Data Sciences for the Life Sciences in a High Performance Computing Environment” was the first formal opportunity for life sciences researchers to learn how to effectively utilize the Holyoke Massachusetts Green High Performance Computing Center (MGHPCC) facility for their research. The workshop was also the first offering of the new Biostatistics in Practice series, jointly sponsored by UMass Amherst and the MGHPCC.

ICB3-088-edit-1-1024x682These firsts reflect the increasing role of statistical, mathematical and computational research methods for life science research as well as the increasing need for researchers to work with large-scale data from diverse sources. Campus sponsors of the series are the Graduate Program in Biostatistics and the UMass Institute for Computational Biology, Biostatistics, and Bioinformatics (ICB3).

ICB3-071-1-1024x682Workshop participants included faculty and graduate students from UMass Amherst as well as other colleges, universities, and healthcare organizations, all curious to access the MGHPCC in order to learn state-of-the-art tools from a teaching team drawn from UMass Amherst Biostatistics faculty. Assistant professor Nicholas Reich, workshop director, ICB3 director and head of biostatistics Andrea Foulkes, and lecturer Gregory Matthews each delivered modules that together provided a foundational curriculum on statistical computing using R, an open-source and freely-available statistical programming language, in a high performance computing environment while providing practical experience in using it on the MGHPCC platform. In addition to the instructors, teaching assistants provided a high-level of individual support for workshop participants.

R is rapidly being adopted as the programming language of choice for researchers at the intersection of life and statistical sciences. As an open-source language it is affordable to researchers regardless of budget and facilitates sharing of the code for new techniques and tools. This flexibility seems to be especially valuable to researchers in biostatistics, bioinformatics, and computational biology, who are being challenged to invent new approaches and methods to cope with the analytical power to draw insights from “big data,” which can be too voluminous or complex to be understood using conventional methods alone.

On a practical level, high-performance computing is another key to drawing insight from data that is measured in terabytes or petabytes and can far exceed both the computing and storage capacities of even the most powerful desktop computers. MGHPCC director John Goodhue explains, “Moving and processing big data with conventional methods is a bit like asking a single person to move and read every book in the Library of Congress with a hand truck and reading lamp. The MGHPCC is equipped for high bandwidth communications to allow data to be moved in or out expeditiously and to connect many computing “cores” so they can work “in parallel” to handle large and/or complex datasets, thus making it possible for researchers to run programs that analyze the data in a reasonable period of time—hours or days instead of weeks.” Participants were able to tour the MGHPCC facility to gain a better appreciation of the thoughtful design as an environmentally responsible high-performance computing resource as well as its research capacity.

ICB3-050-edit-1-1024x765Speaking about the changing role of statistical and computational methodologies in life science research Andrea Foulkes notes, “Biomedical researchers are able to generate large quantities of data providing in-depth coverage within and across individuals. The MGHPCC offers state-of-the-art computational resources for data management and processing which, coupled with powerful R tools, enable researchers to turn data into knowledge.”

Looking forward to future offerings, Reich sees many opportunities. “We were pleased that the workshop was fully subscribed well in advance of the meeting. Given the level of interest we will consider holding it again soon, perhaps as early as this summer. We have other topics in mind as well, and with the start-up of the UMass Institute for Applied Life Sciences we expect the list to grow.”

All images courtesy: School of Public Health and Health Sciences, UMass, Amherst.

Story by Karen Lauter-Utgoff