Technical post - new data cleaning/filtering
Added 2021-01-16 11:38:10 +0000 UTCThe quantity of data used during training is very important for converging a model well. But more important still is the actual quality of the data used. As such, there are a number of data cleaning/filtering scripts that I use, to make sure the quality of the data is as high as possible. There are scripts for adjusting the audio, and scripts for filtering out lines.
Having given it some more thought, I came up with an additional filtering step which I hope will help clean out the data further, which I'll describe in this post.
If you're unfamiliar with the inner workings of the model, have a look at this brief overview in this article I wrote on it:
https://becominghuman.ai/generating-neural-speech-synthesis-voice-acting-using-xvasynth-fc978fdf24c1
The parts we are interested here are the pitch and duration vectors, created during the pre-processing steps. The new filtering script iterates through all lines and calculates statistics on these duration and pitch values - these can then be used for downstream filtering in anomaly detection.
There are sophisticated approaches I can take for this, but initially I have used this for a simple ordering of the lines based on their average distance of per-letter pitch/duration values to their respective averages. So, given a line saying "Hello", and a record of the average pitch/duration values for each of the letters for a specific voice, I can calculate a number for how well this audio line matches the average pitch/pacing of the voice.
Most letters are fairly consistent, though there are letters where this doesn't work as well:

Calculating such a number for every line of audio allows me to sort them by this likeness score, as you can see in the final plot in the figure below:

The x scale here is the index of the audio file, post-sorting. It's good seeing that the majority of the audio files are similar in like-ness, but the tail-end of this plot reveals that there are a number of samples which are very dissimilar to the voice. Sure enough, listening to these audio samples, these are all samples of (un-labelled) strange coughs, loud screams, very unusual voice style acting, and all manners of other deviations from the norm that may throw off the quality of the model, in some places.
Re-running this visualisation after cleaning out this extreme data, the plot looks a bit more sensible:

There are still some dissimilar values, but we don't want to remove all of these, as we still want variation in voices, for lines containing questions, surprise, and such. Removing this would lead to a very monotone voice.
--
I've added this pre-processing step to the list, and I'll try it soon to see if it ends up making a noticeable difference. I'm expecting this will be especially useful for the male voices, where the range of pitch values is higher than for females. The duration filter should in theory be as useful for both.