Mixed-language synthesis for Indian languages with dual acoustic models and probabilistic cross-lingual speaker adaptation

R. Karhila1, D. Gowda1, M. Gibson2, O. Watts3, A. Suni4, M. Kurimo1
1 Aalto University, 2 Nuance Communications inc. 3 University of Edinburgh, 4 University of Helsinki

The basic task of the annual Blizzard challenge is to take the released speech data, build synthetic voices, and synthesize a prescribed set of test sentences. The 2014 challenge was to build synthetic voices based on six Indian languages. An additional "spoke" task was to synthesise bilingual sentences, with English words or phrases embedded into the Indian language utterance. Here we show the spoke task systems we submitted for Hindi, Rajasthani and Telugu. The synthesis is based on statistical modelling of vocoder parameters in a Hidden Markov Model (HMM) framework. HMM-based parametric speech synthesis still suffers from some voice quality issues, but generally produces very smooth prosody and allows easy manipulation of speaking style and speaker characteristics.


1. Natural speech samples: Some hundreds of utterances from each speaker are used as training material for synthetic voices.

Author of the danger trail, Philip Steels, etc.


Indian-accented English speaker

इसे कई बार, मंचित भी किया गया है


Hindi speaker

पण आज नुंवै रो हाको घणखरो


Rajasthani speaker

అప్పుల ఊబిలో కూరుకుపోయిన రైతన్నలకు పెట్టుబడులే కనాకష్టం.


Telugu speaker



2. Synthetic speech samples: Samples of synthetic speech generated from parametric models trained on the speech data of (1).

Quince seed gum is the main ingredient in wave-setting lotions.


Indian-accented English speaker

लेकिन, डॉक्टर की भंगिमा मज़ाक की कतई नहीं थी


Hindi speaker

अठै प्रस्तुत है इण परिचरचा रा कीं अंश


Rajasthani speaker

మాజీ మండల టిడిపి అధ్యక్షుడు కంచర్ల హరినాయుడు అధ్యక్షతన జరిగిన సభలో, ఎమ్మెల్యే కొమ్మి మాట్లాడారు


Telugu speaker



3. Cross-lingually adapted synthetic speech samples: These samples are generated by adapting the English model set (2) with the training data of Hindi, Rajasthani and Telugu (1) using a probabilistic mapping between the models of source language and the sound segments of the target language. This adaptation does not require any modelling of the target language.

Soon the office work claimed all her time.


Hindi speaker

We produce peanut oil, but to a much greater extent we eat the entire seed.


Rajasthani speaker

Oh, he'll be a plumber, came the answer.


Telugu speaker



4. Mixed language synthetic speech samples: The native synthesis models (2) and adapted models (3) are combined and used together to generate bilingual speech with uniform speaker characteristics and smooth transitions between languages.

शादी के side effects का first part कोई बहुत ज़्यादा खास कमाल नहीं दिखा पाया था


Hindi speaker

State government राज्य री जनता नै जिका promise करिया हा उणांनै पूरा करण री full मंसा राखै


Rajasthani speaker

వివిధ cell phone కంపెనీలు, యువతను ఆకట్టుకోడానికి, talk time offers ను ప్రకటిస్తున్నాయి


Telugu speaker


Presentation about the used techniques (PDF)

References:

  1. Gibson, M.; Byrne, W., "Unsupervised Intralingual and Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis Using Two-Pass Decision Tree Construction," in IEEE Transactions on Audio, Speech, and Language Processing, vol.19, no.4, pp.895-904, May 2011 (in IEEExplore)
  2. Gibson, M.; Hirsimaki, T.; Karhila, R.; Kurimo, M.; Byrne, W., "Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction," in IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2010, pp.4642-4645, 14-19 March 2010 (in IEEExplore)
  3. Simple4All Ossian front-end, http://homepages.inf.ed.ac.uk/owatts/ossian/html/index.html
  4. T. Raitio, A. Suni, J. Yamagishi, H. Pulakka, J. Nurminen, M. Vainio, and P. Alku, "HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering", in IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 1, pp. 153-165, 2011 (in IEEExplore)
  5. Tokuda, K.; Nankaku, Y.; Toda, T.; Zen, H.; Yamagishi, J.; Oura, K., "Speech Synthesis Based on Hidden Markov Models," in Proceedings of the IEEE , vol.101, no.5, pp.1234-1252, May 2013 (in IEEExplore)
  6. Heiga Zen, Keiichi Tokuda, Alan W. Black, Statistical parametric speech synthesis, in Speech Communication, Volume 51, Issue 11, November 2009, Pages 1039-1064 (in sciencedirect)
  7. Tokuda, K., Zen, H., Yamagishi, J., Black, A., Masuko, T., Sako, S., Toda, T., Nose, T., Oura, K., 2008. The HMM-based speech synthesis system (HTS), http://hts.sp.nitech.ac.jp