xvasynth

Technical post #? - New AI dataset pre-processing tools

Added 2021-11-11 18:01:25 +0000 UTC

I spent some time recently, putting together and finalising a series of new tools for data preparation/pre-processing. These are tools that I've already finished integrating into xVATrainer (which is how I use them now!), so I thought I'd quickly cover them here.

Some of these heavily build on top of the previous technical post, so check that out if you haven't already.

Speaker clustering

As alluded to in that previous post, I've now finalized a speaker clustering script, making use of speaker embeddings, averaged out across dataset samples. Previous to this, I had to manually clean extreme samples from data, or do so via roundabout methods from empirical observations on the kind of data patterns there were.

This is a thing of the past. With the speaker clustering, I can now run the whole dataset through the embedding extraction and clustering, and only pull out the data samples from the clusters of "normal" speech - leaving out clusters of random vocalization, battle cries, laughter, whispers, etc. This results in a much cleaner dataset of only voice lines similar to what you'd expect to hear when synthesizing a new line.

The clustering also leaves room for some future experiments in grouping up data samples of different emotions for a speaker - but this is for later.

Another thing this helps out with, as shown in that previous post, is clustering apart different speakers in a super large, mixed dataset. This clustering script was instrumental in getting the Tyreen and Troy voices done, from Borderlands. The audio files from that game are all mixed into the same folder, with no pattern or indication for which lines belong to which voice.

Speaker similarity search

Another task mentioned in the previous post is nearest neighbour search. This is another tool now finalized and ready for use. Built to accommodate either a single query data sample or several, this allows one to pull out similar voice lines from a large corpus of audio files, based on some that you already have.

Speaker cluster similarity search

Ok, now for the best of both worlds - doing speaker similarity search, but using clusters of audio as data points, rather than specific audio files. In theory, doing individual file similarity search should return you all the audio lines for the speaker in the query audio files. But data variance can mean that you still get a few from other speakers, or that you may lose some good samples due to them being a little more different than other speakers.

So once you perform speaker clustering on your large data corpus, you can then run cluster similarity search with your query sample(s), to sort and return the clusters, arranged by similarity. Then you pick the top one or several clusters, which will belong to the same speaker.

This is the full approach I took for the Borderlands voices, where I was already given a few of the annotated data samples, but we knew there to be more available in the game files. I extracted speaker embeddings for all audio lines in the game, performed clustering on them, then used the few reference audio lines for each of the two characters to run cluster similarity search for the two voices, separately. Examining the top few clusters, I pulled the remaining voice lines from the game for each character, for their complete datasets.

If you ever need to do something similar, and you have a quote of what a character says in a game, you can first run the Auto-Transcription tool on the entire game audio data, and Ctrl+F the quote transcript to find the audio file. Then proceed to either run the speaker similarity search on all the data, or cluster the dataset, and run the cluster similarity search.

Audio normalization

Not an AI tool, but definitely an important one. This tool is super useful in data pre-processing, especially for when you have data from multiple sources - something I previously advised against, in the past. It won't always be enough, but this tool can help bring all the lines to a similar standard of "loudness", to make them more consistent with each other.

I would advise against running this if you're operating on voice lines that came straight from a game, and you plan on putting generated lines back into a game via a mod. You'd usually probably rather not change what the voice sounds like, such that it still sounds as similar as possible to the other game files, and doesn't stick out - even if it's better.

This tool was extremely helpful when I was preparing the dataset for the 80s Trailer Guy voice (the first voice in the v2.0 showcase video). That voice was incredibly difficult to get data for, but this tool at least made it not impossible.

These and a couple other things are all now in xVATrainer, which I am aiming to build such that it's the only tool you'll need, to do anything you need to do, to get a voice model trained.