xvasynth

AI source separation - live action and echo-y voices now possible

Added 2021-06-22 10:09:36 +0000 UTC

If you recall the first release of the C&C: Kane voice, I mentioned plans to research some additional data pre-processing techniques, to help improve the training data for voice models. This was motivated by unclean audio such as the live action cutscenes from which I pulled data for Kane. The background humming and computer beeping in the training data made it into the trained data, meaning the quality wasn't amazing.

So this is where AI source separation comes in - an AI model which is given some very noisy audio, and it aims to separate the speech data away from any other non-speech sounds.

This new data pre-processing step works wonders for unclean data, such as Kane's, and the updated voice model released in the last interim batch speaks for itself (heh) - the quality of the voice is much much better. Having isolated Kane's voice, the voice model now synthesizes very clean audio, without any of the other unrelated sounds present in the cutscene training data.

Here's a before/after comparison: https://soundcloud.com/user-965312292/kane-comparison

...

In fact, the audio source separation works so well, that we can now use even audio with very loud music playing in the background. Recall also the GTA San Andreas K-DST host voice from the last batch. This voice was more of an experiment around this new technique, to see how far it can be pushed.

Here's a line from the game data for the radio host, before/after the AI source separation: https://soundcloud.com/user-965312292/kdst-comparison

...

Finally, an unexpected other use for this is echo removal. Voices that have heavy amounts of echo in them are impossible to train a voice for, because the letter data "bleeds" into other letters (temporal distortion). The model learns speech data at the letter level, so if a letter's speech data is spread out across other letters, the model has a hard time differentiating between them, and thus usually fails.

Here's a final comparison, with a VERY echo-y voice from Skyrim (MaleGhost) which I ruled out as impossible to train. Before/after the echo removal via the AI source separation model: https://soundcloud.com/user-965312292/ghost-comparison

...

This technique opens up the door to quite a lot of possibilities. I've historically advised against trying to use data from anything live action, as background sounds would break the training. With this, I am changing my stance on this, as the model seems to be able to do a really good job at cleaning the data. The de-echoing further enables training for voices that I previously deemed impossible (although, the generated voices won't have an echo - that will need be added back in post). The cleaning isn't perfect, as you can maybe tell from these comparisons, but the imperfections normally "come out in the wash" when enough training data is used, as the speech model averages out letter data.