Wavenet is a generative neural network that has the purpose of generating audio (specifically speech, but can also be used to generate arbitrary audio such as musical instruments as well) sample by sample.
One interesting thing I've noticed about Wavenet is that the training data doesn't need to be timed at the phoneme level like [hts]. All of the training data consists of small WAV files with transcribed text. The transcribed text has no timing info or annotations other than just the words themselves. It's possible that there is some pre-processing step that converts them into phonemes or some other kind of translation, but I also wouldn't be entirely surprised if that isn't the case. In a demo they showed, the model is capable of learning different pronunciations based off of word position and context in the sentence.
Here's an official NVIDIA publication going over the architectures of both Tacotron 2 and WaveGlow, showing setup procedures for them, and comparing performance: https://ngc.nvidia.com/catalog/resources/nvidia:tacotron_2_and_waveglow_for_pytorch/performance
[1] https://wiki.aalto.fi/display/ITSP/Statistical+parametric+speech+synthesis