xvasynth

Technical post - explainer on the models used

Added 2021-02-20 15:40:00 +0000 UTC

I thought I'd quickly write up a semi-technical post, to explain the models used, in more detail. This should be useful to understand the current plans, as well as future plans for how voices are released.

There are four models being used in the xVA process (for now). The models are Tacotron2, FastPitch, HiFi-GAN, and WaveGlow, and they can be categorised as follows:

Pre-processing: Tacotron2
Sentence generation: FastPitch
Audio generation (vocoder): WaveGlow, HiFi-GAN

There are two main factors of the audio quality which we are concerned with:

1) The sentence construction quality

This is how well the app is able to form words and sentences. If the quality of this factor is not good, generating something good may take a bit of trial-and-error, while messing with the spelling and the punctuation of the input text

2) The... audio quality?

So basically how clear the audio is, and if there is any reverb, etc.

The pre-processing step helps with 1), and the vocoder quality helps with 2).

FastPitch:

This model is the star of the show. This is the component which puts data together to form words and sentences. It also is the model which I hijack intermediate values from, to build the pitch and durations editor.

This model doesn't actually produce any audio, however. To make it easier/possible for it to get trained, the model is trained to go just from text to mel spectrograms. These spectrograms are a simpler way to represent audio data internally. This is an example I found online of what a mel spectrogram looks like:

During training, the model is told to re-create these representations of audio from text/spectrogram pairs. To get a mel spectrogram to use as reference during training, a non-ML algorithm is used, which simply projects the audio into this smaller representation.

WaveGlow:

The next part of the process is the vocoder. A vocoder is a model which looks at these mel spectrograms and creates the actual .wav audio file. There are a few different vocoders out there, but FastPitch was researched with WaveGlow as the vocoder, so by default, that is also the model I used.

While FastPitch needs to be trained on a per-voice basis, the WaveGlow vocoder does fairly (but not always) well on most voices you throw at it, as it just cares about synthesizing the already audio data it was given.

So to generate audio from text, an input text sequence is given to FastPitch, which gives us a mel spectrogram. We then give this mel spectrogram to WaveGlow, which creates the .wav file from it.

I included two versions of this WaveGlow model with xVA - the normal model we are advised by FastPitch to use, and a bigger version of this model, which sometimes helps push the quality further (though not always, hence the option).

Tacotron2:

While I said that FastPitch is trained on pairs of text/spectrogram pairs, this is not actually complete. In addition to these two bits of data, each input also contains a list of values for the pitch, and duration of each of the letters. Unlike the text transcript that is available to use from the game files, this extra (quite detailed) information is not available.

To get this, we have to make use of another, older (but still quite good) text-to-speech model, Tacotron 2. It doesn't have the same features as FastPitch, but it CAN be trained to build these spectrograms without the need for the pitch and duration values.

Once one of these T2 models is trained on a voice, you can pass the data through it and get the duration information from intermediate values in the model. So as a pre-processing step for FastPitch data, a Tacotron 2 model is first trained, and then the duration data is extracted for the FastPitch training data.

The pitch data can be calculated without an ML model, using a tool called parselmouth. Once we have durations data, we can split the audio into its individual letters, and this algorithm can calculate the average pitch across these fragments of audio.

Prior to the new GPU upgrade, I wasn't able to train one of these models myself. I was forced to just use a provided one-size-fits-all checkpoint pre-trained on a female voice. While this was fine for a few female voices which sounded similar to the female voice that the checkpoint was trained on, the more different a voice sounded from it, the more inaccurate these duration values were. This lead to bad sentence construction quality, leading to a lot of tweaking and playing with spelling/punctuation, to get right.

However, since the upgrade I can, and have been training/fine-tuning this Tacotron 2 model for every voice, before doing the pre-processing step. This has completely solved the problem with sentence construction, for all if not most voices.

Hi-Fi GAN:

I've had my eyes on this model for some time (I think since it was initially announced). This is another vocoder model, but it has a couple of benefits over the older WaveGlow model:

It is MUCH smaller in file-size (at about 50mb instead of the 600mb/1GB of WaveGlow/WaveGlow BIG)
It is MUCH faster than WaveGlow

However, just like Tacotron2 and WaveGlow, training such a model myself from scratch was unachievable on the old hardware. So for the v1.0 release, I included this model as a "quick-and-dirty" option because while the one-size-fits-all pre-trained model wasn't as good as the one-size-fits-all WaveGlow model, it was definitely much faster.

As you may have guessed, following the new hardware upgrade, training such a Hi-Fi-GAN model myself became possible. By fine-tuning one of these models on specific voices, the audio quality (factor 2)) improves a good amount - and a drastic amount for voices where the WaveGlow model didn't do so well at.

So after fine-tuning these bespoke HiFi-GAN models, the benefits are:

much faster generation time than WaveGlow
much smaller file size than WaveGlow
much better quality than WaveGlow

It almost makes having a GPU version of xVA unnecessary. I can imagine a future version of xVA where there is just one type (no CPU-only/GPU+CPU split), and WaveGlow models are abandoned completely.

The downside is that that this model is veery slow to train. The approach I am going to take, to reduce the waiting time to get one of these out for every voice is to wait until I have a very large amount of voices trained, and train them all together, before doing (hopefully) shorter training on each voice separately.

---

So going forward, whenever you see these Tacotron2 / Hi-Fi updates, this is what I am referring to. When bespoke vocoders are available, you will be able to select them from a dropdown, and there will be a yellow 🗲 symbol next to the dropdown when you click the voice.