Speech is a one-to-many task, given a sentence it can be articulated in many ways. However, current techniques in speech synthesis always produce the same speech given the same text.
Existing statistical parametric speech synthesis (SPSS) techniques do make use of generative models (DNNs can be seen as providing mean predictions where variance is calculated for the training data), however these models are not sampled from. Instead, maximum likelihood parameter generation (MLPG) is used to find the most likely sequence of predictions. This method is preferred as samples drawn from the models consistently sound less natural.
Samples from these systems are worse than the state-of-the-art. One approach to improving the production of appropriate-sounding prosody, is to separate the process of choosing how to say the sentence from the waveform generation stage. This separation is possible due to the existence of conditional generative models such as WaveNet. Given the context of a sentence we can determine the performance, and then use this to drive waveform production - be it through a statistical system, or a template based system.