The Trickiness of Talking to Computers


by Helen Hill for MGHPCC

James Glass is a senior research scientist at the Massachusetts Institute of Technology. Glass leads the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL.) His research is focused on automatic speech recognition, unsupervised speech processing, and spoken language understanding. This past spring, assisted by graduate student David Harwath, Glass was the instructor for MIT’s 6.345/HST.728 Automatic Speech Recognition class but this year, for the first time, students had the option of using high performance computing resources at the MGHPCC to facilitate their work.

6.345/ HST.728 is a graduate level course aimed at introducing students to the rapidly evolving field of speech recognition and spoken language processing. While the first half of the class covers concepts and computational techniques through traditional lectures, labs and problem sets, where accompanying computation is readily accommodated on MIT’s public Athena computing network, the second half of the class comprises a typically much more computationally demanding final term project.

As part of the course curriculum, the students chose a current research topic to explore. Some students chose to write programs capable of automatically recognizing the language that a person was speaking, while other students created systems that were able to infer the emotional state or personality traits of a speaker. Because most of the projects relied on data-hungry and computationally intensive statistical machine learning algorithms, the MGHPCC was key in enabling the students to complete their term projects.

“Having access to the MGHPCC allowed the students to use more sophisticated models involving more complicated elements. Many of the projects draw on machine learning techniques reliant on leveraging large quantities of data to train the models. In the past we let students use our group’s facilities but having recently redesigned the course to accommodate a curriculum shift more towards deep neural network models we realized, going forward, students really needed something bigger,” says Glass.

Fortunately a timely comment from one of his students mentioning her great experience using MGHPCC with a different project led Glass to contact Christopher Hill the Director of MIT’s Research Computing Project who’s team then worked with Harwath to provide class members access to the resources they needed.

“Some of the students in the class had access to their own computing resources, but for those who didn’t the availability of a facility internal to MIT, with lots of pre-installed libraries they could leverage was terrific. Of course we had other options. For example, some other classes have used Amazon Cloud, but for our purposes this seemed like a much more natural set-up and one we are eager to repeat.”

“Siri. Alexa. Voice recognition software has reached a tipping point.” Glass tells me. “Nonetheless there is still plenty more room for improvement.”

“The ability to speak and use language is a critical skill for machines to master,” he says, “and it’s a very hard problem because speech is a signal that gets contaminated by noise. The physics of everybody’s vocal tract is different, your linguistic background, your dialect. The sound of your voice changes with the situation you are in, your emotional state, whether you are inside or outside: The speech signal when you say the exact same thing, its never ever the same. Interpreting context, deconstructing dialogue: Giving the students serious HPC tools to work with takes what we can teach them to a new level.”

About the Researchers

David Harwath (left) and James Glass – image credit: Helen Hill

James Glass is a senior research scientist at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) where he heads the Spoken Language Systems Group. He is also a lecturer in the Harvard-MIT Division of Health Sciences and Technology. His primary research interests are in the area of speech communication and human-computer interaction centered on automatic speech recognition and spoken language understanding.

David Harwath is a graduate student in Glass’s Group doing research combining speech and visual perception based on collected data of people talking about pictures.


James Glass

Spoken Language Systems Group, MIT