StethoSpeech: Speech Generation Through Stethoscopic Microphone Attached To The Skin

Neil Shah1,2, Neha Sahipjohn1, Vishal Tambrahalli1, Ramanathan Subramanian3, Vineet Gandhi1

1International Institute of Information Technology, Hyderabad, India

2TCS Research, Pune, India

3University of Canberra, Australia


We introduce StethoSpeech, a Silent Speech Interface that transforms flesh-conducted vibrations behind the ear into speech. The innovation is aimed to improve social interactions for those with voice disorders and enable discreet public communication. Unlike prior efforts, StethoSpeech does not require the paired speech data for the recorded vibrations. Furthermore, it does not need a specialized device for recording the vibrations and can work with an off-the-shelf clinical stethoscope. The novelty of the framework lies in overall design, simulation of ground truth speech, and sequence-to-sequence translation network, which works in the latent space. We present comprehensive experiments on the existing CSTR NAM TIMIT Plus corpus and the newly proposed StethoText dataset. Our results show that StethoSpeech provides natural-sounding and intelligible speech, significantly outperforming the existing methods on several quantitative and qualitative metrics. We also demonstrate its ability to work in extremely noisy scenarios.

StethoSpeech's teaser diagram

(a) StethoSpeech converts flesh-conducted vibrations into intelligible speech. (b) StethoSpeech is effective even when extremely loud music is playing in the background. Here, we show the Mel-spectrogram representations of recorded audio, the Stethoscopic vibrations, and converted speech using our proposed StethoSpeech. Output from an Automatic Speech Recognition (ASR) engine is shown on top. ASR completely fails to comprehend the noisy audio and the stethoscopic vibrations. It correctly predicts the converted speech using the StethoSpeech framework.

Proposed Architecture

StethoSpeech is a speech conversion mechanism from vibrations (also known as Non-Audible Murmur (NAM)). It comprises a data preparation step to generate ground-truth speech corresponding to NAM vibrations, a shared speech encoder (pre-trained and frozen) to extract self-supervised embeddings, a sequence-to-sequence network to map self-supervised embeddings of vibrations to that of speech, and a speech decoder to synthesize speech from the self-supervised speech embeddings.

R1: Comparative evaluation of DiscoGAN and MSpec-Net methods using web-page samples, compared with StethoSpeech.

Ground-truth text Input NAM vibrations DiscoGAN MSpec-Net StethoSpeech (paired) StethoSpeech (unpaired)
It's the whole season.
It is a terrible loss.

R2: Comparative evaluation of DiscoGAN and MSpec-Net Methods Implemented from open-source Code, Compared with StethoSpeech.

Dataset Ground-truth text Input NAM vibrations DiscoGAN MSpec-Net StethoSpeech (unpaired)
CSTR NAM TIMIT Plus Please call stella
CSTR NAM TIMIT Plus Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.
s1 (StethoText) and the next year gunther zeiner at augsburg followed suit;
s1 (StethoText) and was used there with very little variation all through the sixteenth and seventeenth centuries, and indeed into the eighteenth.

R3: More samples from our proposed methods on CSTR NAM TIMIT Plus corpus

Ground-truth text Input NAM vibrations StethoSpeech (paired) StethoSpeech (unpaired)
I am not retiring.
I hated the word.
That was a month ago.
I think we're going to make it.
I now know that from memory.
The decision was welcomed by downing street.

R4: Generation of ground-truth speech using paired method employing whisper audio and unpaired methods on CSTR NAM TIMIT Plus corpus.

Ground-truth text Input NAM vibrations Generated Ground-truth (paired) Generated Ground-truth (unpaired)
I think we're going to make it.
I now know that from memory.
They have no other children.
That was a month ago.
It's the whole season.

R5: Generated speech from speaker s1 in StethoText corpus using StethoSpeech (unpaired) method.

Ground-truth text Input NAM vibrations Generated Ground-truth Generated Speech in voice 1 Generated Speech in voice 2
It is growing, every day, every hour.
This is the essence of our philosophy.
The lion followed him and overtook the camel.
Lion demanded to know the story.
It has become a way of life.
The crow said that the camel was a domestic animal fit to be killed and eaten.

R6: Generated speech from speaker s2 in StethoText corpus using StethoSpeech (unpaired) method.

Ground-truth text Input NAM vibrations Generated Ground-Truth Generated Speech in voice 1 Generated Speech in voice 2
I suggest you must offer yourself to the lion.
The jury is still out.
This will help our confidence.
And there was a dog that barked.
He was eager to show his mother, how brave he was.
He kept repeating it, all the way.

R7: Evaluation of StethoSpeech's generalizability in zero-shot speech synthesis using StethoSpeech (unpaired) method.

Ground-truth text Unseen speaker Input NAM vibrations Generated speech in voice 1 Generated speech in voice 2
Instead, I must be careful in finding out the source of this noise. s13
hit the ground and turn into gold. s12
it is too early to say. s1
i was not to cry out in the face of fear. s11

R8: Assessing StethoSpeech robustness: speech synthesis from NAM vibrations in loud background music environment.

Ground-truth text Noisy speech Predicted text on noisy speech using ASR Noisy NAM vibrations recorded using stethoscope Generated speech using StethoSpeech (unpaired) Predicted text on generated speech
Please continue your journey. good afternoon boys and girls please continue your journey
My husband is one example. life of baby girl my husband is one example
This is no place for you. i will give it a shot now this is no place for you
Thus speaks our religion. that is a good idea thus speaks our religion
On this they started looking at each other. i on this they started looking at each other
But his problem remained. but the good stuff can be made but his problem remained

R9: Evaluating StethoSpeech framework's robustness to user motion: speech synthesis from NAM vibrations recorded during walking for speaker s1 in StethoText corpus.

Ground-truth text Input NAM vibrations Generated Speech using StethoSpeech (unpaired)
When you are in a difficult situation.
Once there was a naughty boy.
There was once a cowardly fox.
He decided to teach them a lesson.
leaving nothing for the poor mice.