Using noisy data for speaker adaptation in HMM-based Speech Synthesis

Training a good-quality synthetic voice requires some hours of speech data recorded in good conditions.
Using a speaker-adaptive framework, we can first build a good-quality average voice from hours of speech and then adapt it to sound like the target speaker from just some minutes of speech data.
In our work, we investigated the noise tolerance of the HMM adaptation techniques. How does background noise in the speech data affect the Mel-Cepstral component adaptation? In these samples, standard CSMAPLR adaptation has been used, with parameters optimised for noisy source data - no specific noise-robust techniques have been used.

Training data samples * →	Clean speech F: M:	Babble-corrupted speech F: M:	Factory-corrupted speech F: M:

Average voices F: M:	↓	↓	↓
↳	CSMAPLR adaptation with 100 sentences	CSMAPLR adaptation with 100 sentences	CSMAPLR adaptation with 100 sentences
	↓	↓	↓
Synthesised samples → (Synthesised MCEP, original F0)	F: M:	F: M:	F: M:

*) This particular sample is from the test set and has not been used in adaptation

For details see:

R. Karhila, U. Remes, and M. Kurimo, "HMM-based speech synthesis adaptation using noisy data: Analysis and evaluation methods," in Proc. ICASSP, 2013
R. Karhila, U. Remes, and M. Kurimo, "Noise in HMM-based speech synthesis adaptation: Analysis, evaluation methods and experiments," IEEE Journal of Selected Topics in Signal Processing, published on the web 15 August 2013, doi: 10.1109/JSTSP.2013.2278492