We introduce StethoSpeech, a Silent Speech Interface that transforms flesh-conducted vibrations behind the ear into speech. The innovation is aimed to improve social interactions for those with voice disorders and enable discreet public communication. Unlike prior efforts, StethoSpeech does not require the paired speech data for the recorded vibrations. Furthermore, it does not need a specialized device for recording the vibrations and can work with an off-the-shelf clinical stethoscope. The novelty of the framework lies in overall design, simulation of ground truth speech, and sequence-to-sequence translation network, which works in the latent space. We present comprehensive experiments on the existing CSTR NAM TIMIT Plus corpus and the newly proposed StethoText dataset. Our results show that StethoSpeech provides natural-sounding and intelligible speech, significantly outperforming the existing methods on several quantitative and qualitative metrics. We also demonstrate its ability to work in extremely noisy scenarios.
Ground-truth text | Input NAM vibrations | DiscoGAN | MSpec-Net | StethoSpeech (paired) | StethoSpeech (unpaired) |
---|---|---|---|---|---|
It's the whole season. | |||||
It is a terrible loss. |
Dataset | Ground-truth text | Input NAM vibrations | DiscoGAN | MSpec-Net | StethoSpeech (unpaired) |
---|---|---|---|---|---|
CSTR NAM TIMIT Plus | Please call stella | ||||
CSTR NAM TIMIT Plus | Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. | ||||
s1 (StethoText) | and the next year gunther zeiner at augsburg followed suit; | ||||
s1 (StethoText) | and was used there with very little variation all through the sixteenth and seventeenth centuries, and indeed into the eighteenth. |
Ground-truth text | Input NAM vibrations | StethoSpeech (paired) | StethoSpeech (unpaired) |
---|---|---|---|
I am not retiring. | |||
I hated the word. | |||
That was a month ago. | |||
I think we're going to make it. | |||
I now know that from memory. | |||
The decision was welcomed by downing street. |
Ground-truth text | Input NAM vibrations | Generated Ground-truth (paired) | Generated Ground-truth (unpaired) |
---|---|---|---|
I think we're going to make it. | |||
I now know that from memory. | |||
They have no other children. | |||
That was a month ago. | |||
It's the whole season. |
Ground-truth text | Input NAM vibrations | Generated Ground-truth | Generated Speech in voice 1 | Generated Speech in voice 2 |
---|---|---|---|---|
It is growing, every day, every hour. | ||||
This is the essence of our philosophy. | ||||
The lion followed him and overtook the camel. | ||||
Lion demanded to know the story. | ||||
It has become a way of life. | ||||
The crow said that the camel was a domestic animal fit to be killed and eaten. |
Ground-truth text | Input NAM vibrations | Generated Ground-Truth | Generated Speech in voice 1 | Generated Speech in voice 2 |
---|---|---|---|---|
I suggest you must offer yourself to the lion. | ||||
The jury is still out. | ||||
This will help our confidence. | ||||
And there was a dog that barked. | ||||
He was eager to show his mother, how brave he was. | ||||
He kept repeating it, all the way. |
Ground-truth text | Unseen speaker | Input NAM vibrations | Generated speech in voice 1 | Generated speech in voice 2 |
---|---|---|---|---|
Instead, I must be careful in finding out the source of this noise. | s13 | |||
hit the ground and turn into gold. | s12 | |||
it is too early to say. | s1 | |||
i was not to cry out in the face of fear. | s11 |
Ground-truth text | Noisy speech | Predicted text on noisy speech using ASR | Noisy NAM vibrations recorded using stethoscope | Generated speech using StethoSpeech (unpaired) | Predicted text on generated speech |
---|---|---|---|---|---|
Please continue your journey. | good afternoon boys and girls | please continue your journey | |||
My husband is one example. | life of baby girl | my husband is one example | |||
This is no place for you. | i will give it a shot now | this is no place for you | |||
Thus speaks our religion. | that is a good idea | thus speaks our religion | |||
On this they started looking at each other. | i | on this they started looking at each other | |||
But his problem remained. | but the good stuff can be made | but his problem remained |
Ground-truth text | Input NAM vibrations | Generated Speech using StethoSpeech (unpaired) |
---|---|---|
When you are in a difficult situation. | ||
Once there was a naughty boy. | ||
There was once a cowardly fox. | ||
He decided to teach them a lesson. | ||
leaving nothing for the poor mice. |