Adversarial methods for speech synthesis


Jennifer Williams


There is a need for highly customizable TTS technology to suit a variety of consumer audiences. We take an approach to TTS development that addresses the need for making speech that is highly customizable while at the same time harnesses model robustness. Our approach is based on a type of machine learning called adversarial Learning. In adversarial learning, a model is trained with feedback from a discriminator or interrogator. One such deep learning framework is called Generative Adversarial Networks (GANs). We propose to use GANs with speech-related discriminator adversaries, to influence the  quality and flexibility of our TTS output. With GANs, we can define any representation for the model as well as any representation for the discriminator. This is important for our work because we are seeking to couple two or more target speech technologies, such as speaker identification and TTS, or speaking style and TTS.

We will learn how to detect whether or not a speech signal has been modified in some way. We will explore how to train and optimize our GAN-based system as a TTS system that either generates waveforms directly, or components of the speech signal that can be passed on to a high-quality vocoder. In particular, we want to determine which types of adversaries (i.e. interrogators) have the greatest impact on  TTS output. Some types of adversaries that we are considering include:  speaker identification, accent, age, gender, pitch, duration, and other high-level speech signal features. The final phase of our work is to explore coupling together multiple adversaries in a GAN system for TTS generation. This aspect of our work involves more machine learning design and optimization, compared with our earlier tasks which rely more on signal processing features. This type of advanced machine learning architecture has not yet been explored for speech generation. We are interested in using a multi-adversarial approach to optimize our speech generation across various targets. For example, we may train a multi-adversary TTS system that simultaneously learns speaker identification while also monitoring pitch. We can explore ablation groups of adversaries as well as find optimal ways to initialize and train such a system for speech generation. We believe that any outcomes in this phase of our work could contribute novel  ideas to the fields of machine learning and also to TTS.

Our final TTS system should consist of quality at or exceeding that of commercially-available TTS products, but with the ability to use in real-time, with synthetic speech that is highly customized. Our exploration of multiple interrogators could potentially lead to new theoretical contributions to the field. Our work, and what we learn in this project, has the potential to improve quality of life for those individuals. TTS for accessibility purposes must allow a user to be able to control many different aspects of the speech output. This ranges from expressing sincerity, to calling for help. Our work is toward this type of flexibility.

Supervisors: Simon King and Steve Renals