Speech Synthesis from Brain Activity

[En español]

The existing technology that assists people with speech disabilities is reliant on brain-computer interfaces which translate eye and facial muscle movements into words. However, this translation is limited in speed – approximately 10 words per minute, which is considerably lower than the rate of naturally produced speech (150 words per minute). The process of spelling out thoughts in a letter-by-letter way is understandably painful, often erroneous and always slower than one would expect to be able to speak. Current work from Edward Chang’s Laboratory at the University of California San Francisco (UCSF) provides an advance towards the direction of restoring speech in patients with ALS, Parkinson’s Disease and several other neuro-degenerative diseases which take away the ability to speak. Chang’s technology works or at least empirically promises to work close to the rate of normal speech.

Prior to this study, Dr Nima Mesgarani’s team at the Columbia University, New York had attempted to shed some light on restoring speech in a study published earlier this year in Scientific Reports. This team recorded neural activity from the auditory cortex of patients who listened to recordings of short spoken sentences. This data was then used to ‘train’ a computer program. This program, once trained, could produce spoken outputs when the input was neural activity. These sounds (of reconstructed digits) were comprehensible to a group of listeners. “We found that people could understand and repeat the sounds about 75% of the time, which is well above and beyond any previous attempts,” Dr. Mesgarani claimed in a Columbia University statement. He added, “In this scenario, if the wearer thinks ‘I need a glass of water,’ our system could take the brain signals generated by that thought, and turn them into synthesized, verbal speech. This would be a game changer. It would give anyone who has lost their ability to speak, whether through injury or disease, the renewed chance to connect to the world around them.”

Tanja Schultz’s team from the University of Bremen in Germany also accomplished to produce spoken outputs from a computer program which was made to listen to single words. Here, the success rate at which a listener could identify what is being spoken was 30-50%.

On the other hand, in spite of having the same goal as the above two groups, Edward Chang’s laboratory took a distinct approach. Dr. Gopala Anumanchipalli and Josh Chartier, two scientists from Chang’s group and authors of a revolutionary speech-science paper published in Nature last month, put an emphasis on the motor movements of the vocal organs in the route between our brain’s speech centers and sounds. After all, the brain does not directly produce audible sound. The brain directs the movements of the vocal organs such as tongue and larynx and lips and jaws, and these precise movements produce distinctive sounds. Chang’s group attempted to model this natural phenomenon in order to make a prosthetic speech-machine instead of trying to convert brain activity directly to sound which was reflected in the earlier work from, for example, Dr Nima Mesgarani’s group.

Edward Chang, UCSF

The Chang lab’s approach actually took hints from their own previous work where they outstandingly described how the brain centers actually ‘choreograph’ the movements of several vocal organs to produce speech. “Very few of us have any real idea, actually, of what’s going on in our mouth when we speak. The brain translates those thoughts of what you want to say into movements of the vocal tract, and that’s what we’re trying to decode” – Dr. Chang told Reuters. Acknowledging the limitations of the previous studies which attempted to decode speech directly from the brain signals, Chang’s group took a two-pronged approach. The first level in this was from the brain to the vocal organs and the second was from the vocal organs to the audible speech. “The relationship between the movements of the vocal tract and the speech sounds that are produced is a complicated one. We reasoned that if these speech centers in the brain are encoding movements rather than sounds, we should try to do the same in decoding those signals” – Dr. Anumanchipalli mentioned to Science Daily.

Courtesy: UCSF Neurosurgery, YouTube

In this two-pronged approach, the first step was to collect an extensive amount of cortical activity data from five human subjects as they spoke several hundred sentences aloud. The regions of the brain where these data were collected from are known to be the regions responsible for the control of the motor movements in the mouth and throat that produce sound. Matching the neural activity to the movements of the lips, larynx, tongue etc., they developed a computer-based system that could decode the neural signal into sound-producing movements. The second step was to be able to generate speech from these decoded movements. Then these spoken words were made available for third party listeners to see whether these machine-produced sounds were indeed comprehensible. The success rate in this varied from 31% to 53%. Although the success rate is not magnificent, it still does put forward a new method to simulate speech that could be upgraded. A 50% success rate, in any case, could be a significant improvement if the starting point was absolute speechlessness. “We were shocked when we first heard the results – we couldn’t believe our ears. It was incredibly exciting that a lot of the aspects of real speech were present in the output from the synthesizer. Clearly, there is more work to get this to be more natural and intelligible but we were very impressed by how much can be decoded from brain activity” – Josh Chartier, one of the authors, told Reuters.

Courtesy: UCSF Neurosurgery, YouTube

Since the participants used in this study had the ability to vocalize, this UCSF team also attempted to synthesize mimed speech since the primary goal was to be able to install speech in people who cannot vocalize at all.  The participants were instructed to mime the words and sentences without making any audible sound.

Although it is understandably distant from clinical trials, Dr. Anumanchipalli is hopeful as he spoke to Physics World, “It was really remarkable that we could still generate audio signals from an act that did not create audio at all. If someone can’t speak, then we don’t have a speech synthesizer for that person. We have used a speech synthesizer trained on one subject and driven that by the neural activity of another subject. We have shown that this may be possible”.