Deep learning and intonation in Text to Speech systems
DOI:
https://doi.org/10.36505/ExLing-2019/10/0035/000397Keywords:
Deep learning, artificial intelligence, text-to-speech, prosodic structureAbstract
Although Recurrent Neural Networks deliver excellent results in Speech-to-Text and Text-to-Speech (TTS) applications, the generation of satisfactory synthetic sentence prosody remains one of the main causes of the quality differences between human and synthetic speech. These differences do not involve only emotions or attitudes, but also the prosodic structure which determines the way the listener processes the speech flow. This paper explores the theoretical and technical reasons for these difficulties and proposes a better feature engineering approach for deep learning based on an alternate model of sentence intonation, applied to French.
References
Delattre, P. 1966. Les dix intonations de base du français. French Review, 40, 1-14.
Martin, Ph. 1975. Analyse phonologique de la phrase française. Linguistics, 146, 35-68.
Martin, Ph. 2018. Intonation, structure prosodique et ondes cérébrales. London: ISTE, 322 p.
Rossi, M. 1971. Le seuil de glissando ou seuil de perception des variations tonales pour la parole. Phonetica, 23, 1-33.
van den Oord, A. et al. 2016. A Generative Model for Raw Audio. [https://arxiv.org/abs/1609.03499](https://arxiv.org/abs/1609.03499)
Yamagishi, J., Honnet, P.-E., Garner, Ph., Lazaridis, A. 2017. The SIWIS French Speech Synthesis Database, 2016 [dataset]. University of Edinburgh. School of Informatics. The Centre for Speech Technology Research. [https://doi.org/10.7488/ds/1705](https://doi.org/10.7488/ds/1705).
WinPitch. 2019. [www.winpitch.com](https://www.winpitch.com)
Downloads
Published
Issue
Section
License
Articles are published under the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.