StethoSpeech: Speech Generation Through Stethoscopic Microphone Attached To The Skin

Neil Shah^1,2, Neha Sahipjohn¹, Vishal Tambrahalli¹, Ramanathan Subramanian³, Vineet Gandhi¹

¹International Institute of Information Technology, Hyderabad, India

²TCS Research, Pune, India

³University of Canberra, Australia

Accepted to the Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Vol. 8, No. 3

View Keynote Presentation

Download StethoText dataset

Access Official Code

Abstract:

We introduce StethoSpeech, a Silent Speech Interface that transforms flesh-conducted vibrations behind the ear into speech. The innovation is aimed to improve social interactions for those with voice disorders and enable discreet public communication. Unlike prior efforts, StethoSpeech does not require the paired speech data for the recorded vibrations. Furthermore, it does not need a specialized device for recording the vibrations and can work with an off-the-shelf clinical stethoscope. The novelty of the framework lies in overall design, simulation of ground truth speech, and sequence-to-sequence translation network, which works in the latent space. We present comprehensive experiments on the existing CSTR NAM TIMIT Plus corpus and the newly proposed StethoText dataset. Our results show that StethoSpeech provides natural-sounding and intelligible speech, significantly outperforming the existing methods on several quantitative and qualitative metrics. We also demonstrate its ability to work in extremely noisy scenarios.

StethoSpeech's teaser diagram

(a) StethoSpeech converts flesh-conducted vibrations into intelligible speech. (b) StethoSpeech is effective even when extremely loud music is playing in the background. Here, we show the Mel-spectrogram representations of recorded audio, the Stethoscopic vibrations, and converted speech using our proposed StethoSpeech. Output from an Automatic Speech Recognition (ASR) engine is shown on top. ASR completely fails to comprehend the noisy audio and the stethoscopic vibrations. It correctly predicts the converted speech using the StethoSpeech framework.

Proposed Architecture

StethoSpeech is a speech conversion mechanism from vibrations (also known as Non-Audible Murmur (NAM)). It comprises a data preparation step to generate ground-truth speech corresponding to NAM vibrations, a shared speech encoder (pre-trained and frozen) to extract self-supervised embeddings, a sequence-to-sequence network to map self-supervised embeddings of vibrations to that of speech, and a speech decoder to synthesize speech from the self-supervised speech embeddings.

R1: Comparative evaluation of DiscoGAN and MSpec-Net methods using web-page samples, compared with StethoSpeech.

Ground-truth text	Input NAM vibrations	DiscoGAN	MSpec-Net	StethoSpeech (paired)	StethoSpeech (unpaired)
It's the whole season.
It is a terrible loss.

R2: Comparative evaluation of DiscoGAN and MSpec-Net Methods Implemented from open-source Code, Compared with StethoSpeech.

Dataset	Ground-truth text	Input NAM vibrations	DiscoGAN	MSpec-Net	StethoSpeech (unpaired)
CSTR NAM TIMIT Plus	Please call stella
CSTR NAM TIMIT Plus	Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.
s1 (StethoText)	and the next year gunther zeiner at augsburg followed suit;
s1 (StethoText)	and was used there with very little variation all through the sixteenth and seventeenth centuries, and indeed into the eighteenth.

R3: More samples from our proposed methods on CSTR NAM TIMIT Plus corpus

Ground-truth text	Input NAM vibrations	StethoSpeech (paired)	StethoSpeech (unpaired)
I am not retiring.
I hated the word.
That was a month ago.
I think we're going to make it.
I now know that from memory.
The decision was welcomed by downing street.

R4: Generation of ground-truth speech using paired method employing whisper audio and unpaired methods on CSTR NAM TIMIT Plus corpus.

Ground-truth text	Input NAM vibrations	Generated Ground-truth (paired)	Generated Ground-truth (unpaired)
I think we're going to make it.
I now know that from memory.
They have no other children.
That was a month ago.
It's the whole season.

R5: Generated speech from speaker s1 in StethoText corpus using StethoSpeech (unpaired) method.

Ground-truth text	Input NAM vibrations	Generated Ground-truth	Generated Speech in voice 1	Generated Speech in voice 2
It is growing, every day, every hour.
This is the essence of our philosophy.
The lion followed him and overtook the camel.
Lion demanded to know the story.
It has become a way of life.
The crow said that the camel was a domestic animal fit to be killed and eaten.

R6: Generated speech from speaker s2 in StethoText corpus using StethoSpeech (unpaired) method.

Ground-truth text	Input NAM vibrations	Generated Ground-Truth	Generated Speech in voice 1	Generated Speech in voice 2
I suggest you must offer yourself to the lion.
The jury is still out.
This will help our confidence.
And there was a dog that barked.
He was eager to show his mother, how brave he was.
He kept repeating it, all the way.

R7: Evaluation of StethoSpeech's generalizability in zero-shot speech synthesis using StethoSpeech (unpaired) method.

Ground-truth text	Unseen speaker	Input NAM vibrations	Generated speech in voice 1	Generated speech in voice 2
Instead, I must be careful in finding out the source of this noise.	s13
hit the ground and turn into gold.	s12
it is too early to say.	s1
i was not to cry out in the face of fear.	s11

R8: Assessing StethoSpeech robustness: speech synthesis from NAM vibrations in loud background music environment.

Ground-truth text	Noisy speech	Predicted text on noisy speech using ASR	Noisy NAM vibrations recorded using stethoscope	Generated speech using StethoSpeech (unpaired)	Predicted text on generated speech
Please continue your journey.		good afternoon boys and girls			please continue your journey
My husband is one example.		life of baby girl			my husband is one example
This is no place for you.		i will give it a shot now			this is no place for you
Thus speaks our religion.		that is a good idea			thus speaks our religion
On this they started looking at each other.		i			on this they started looking at each other
But his problem remained.		but the good stuff can be made			but his problem remained

R9: Evaluating StethoSpeech framework's robustness to user motion: speech synthesis from NAM vibrations recorded during walking for speaker s1 in StethoText corpus.

Ground-truth text	Input NAM vibrations	Generated Speech using StethoSpeech (unpaired)
When you are in a difficult situation.
Once there was a naughty boy.
There was once a cowardly fox.
He decided to teach them a lesson.
leaving nothing for the poor mice.