Improving intelligibility of synthesized speech in noise with emphasized prosody
DOI:
https://doi.org/10.36505/ExLing-2010/03/0041/000161Keywords:
TTS, speech synthesis, linear prediction, prosody, noisy speechAbstract
The performance of current high quality concatenative text-to-speech (TTS) systems is limited under noisy environments. This paper investigates whether or not the intelligibility of synthesized speech in noise can be improved by emphasizing the prosody. Additionally, the paper presents a method that can effectively emphasize the prosody of units in existing TTS databases. The circular linear prediction (CLP) model is combined with the constant-pitch transform (CPT) to perform pitch and duration modifications to concatenative TTS units with little impact to the subjective quality. Test utterances are generated using the method and compared to reference utterances synthesized by a high quality TTS engine. The subjective test results demonstrate a preference for emphasized prosody in the majority of the test cases.
References
Ertan, A.E., Shukla, S., & Barnwell, T. 2002. Circular LPC modeling and constant pitch transform for accurate speech analysis and high quality speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), Vol. 1, 269-272. Orlando, FL.
Ertan, A.E. 2004. Pitch-synchronous processing of speech. Doctoral dissertation, School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA.
Langner, B., & Black, A.W. 2005. Improving the understandability of speech synthesis by modeling speech in noise. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Vol. 4, 265-268. Philadelphia, PA.
Rudnicky, A., Thayer, E., Constantinides, P., Tchou, C., Shern, R., Lenzo, K., Xu, W., & Oh, A. 1999. Creating natural dialogs in the Carnegie Mellon Communicator system. In: Proceedings of the 6th European Conference on Speech Communication and Technology (Eurospeech 1999), Vol. 4, 1531-1534. Budapest, Hungary.
Shukla, S., & Barnwell, T. 2007. Improving high quality TTS using circular linear prediction and constant pitch transform. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2007), Vol. 4, 681-684. Honolulu, HI.
Venkatagiri, H.S. 2003. Segmental intelligibility of four currently used text-to-speech synthesis methods. The Journal of the Acoustical Society of America, 113(4), 2095-2104.
Downloads
Published
Issue
Section
License
Copyright (c) 2010 Sunil R. Shukla (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
Articles are published under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.