SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DYSARTHRIC SPEECH RECOGNITION

Authors

Mohammad Soleymanpour (University of Kentucky) m.soleymanpour@uky.edu
Michael T. Johnson1 (University of Kentucky) mike.johnson@uky.edu
Rahim Soleymanpour (University of Connecticut) rahim.soleymanpour@uconn.edu
Jeffrey Berry (Marquette University) jeffrey.berry@marquette.edu

Abstract:

Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. To have robust dysarthria-specific ASR, sufficient training speech is required, which is not readily available. Recent advances in Text-To-Speech (TTS) synthesis multi-speaker end-to-end TTS systems suggest the possibility of using synthesis for data augmentation. In this paper, we aim to improve multi-speaker end-to-end TTS systems to synthesize dysarthric speech for improved training of a dysarthria-specific DNN-HMM ASR. In the synthesized speech, we add dysarthria severity level and pause insertion mechanisms to other control parameters such as pitch, energy, and duration. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters.

Dysarthria Severity Level

Abbreviation: Pitch Control: PC, Energy Control: EC, Duration Control: DC, Dysarthria Severity Contorl: SC


Speaker M02:
	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 0.0 ]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 1.0 ]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 2.0 ]
		Text: "bad sad dad bat bit bet pad"

Speaker M04:
	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 0.0 ]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 1.0 ]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 2.0 ]
		Text: "bad sad dad bat bit bet pad"

Speaker MC02:
	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 0.0]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 1.0]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 2.0]
		Text: "This is the pad"

Speaker MC04:
	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 0.0]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 1.0]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 2.0]
		Text: "We like to play volleyball"

Pause Insertion


Speaker M02:
	Number of pause = None	Number of pause = 1	Number of pause = 2
		Text: "How we can synthesize better dysarthric speech?"

Speaker M05:
	Number of pause = None	Number of pause = 1	Number of pause = 2
		Text: "How we can synthesize better dysarthric speech?"

Duration, Pitch and Duration controls on a fixed severity level


Duration(Speaker M05):
	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 2.0]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.3, SC: 2.0]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.6, SC: 2.0]
		Text: "Bad and good"

Energy(Speaker M05):
	Coeff = [PC: 1.0, EC: 0.5, DC: 1.0, SC: 2.0]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 2.0]	Coeff = [PC: 1.0, EC: 2.0, DC: 1.0, SC: 2.0]
		Text: "Bad and good"

Pitch(Speaker 05):
	Coeff = [PC: 0.5, EC: 1.0, DC: 1.0, SC: 2.0]	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 2.0]	Coeff = [PC: 2.0, EC: 1.0, DC: 1.0, SC: 2.0]
		Text: "Bad and good"

Other Observation

We have noticed that the following synthesized speech includes a stutter and Speaker M05 repeats phoneme /b/ before "best" that might be one of the characteristics of dysarthric speech.


Speaker M05:
	Coeff = [PC: 1.0, EC: 1.0, DC: 1.0, SC: 2.0]
	Text: "He is one of the best basketball players"