xvasynth

(very) Technical post - Representation learning of speakers

Added 2021-07-02 10:17:32 +0000 UTC

Ok, this post will go a little deeper than the others. I've been working on something else, in the background, called representation learning. This actually very closely aligns with what I actually do day-to-day as part of my PhD, so it's been quite a bit of fun.

What representation learning is

Representation learning is a whole massive field of deep learning. It encompasses the work we do to make neural networks condense information about something in a very small, compressed set of numbers. We call this small space an "embedding" - as in, we embed an image (or something) into this small space. Embeddings are typically just a vector (list) of numbers. A typical embedding size might be 256 numbers, and this typically is just a layer in the neural network - either from the middle somewhere, or the final output layer.

Representation learning is extremely useful because with it, a neural network is able to pull the most important information from much much more high dimensional data, and represent the unique parts in a (comparatively) very small set of numbers. In this specific case, we consider audio data, which as a .wav file can have tens of thousands of numbers - not something that we can work with very easily. But also, the compression process, when done well, discards any irrelevant data we don't care about ("noise" - not the same meaning as audio noise).

Moreover, when trained in the correct way, embeddings learnt under a representation learning setting are metric - what this means is that as you interpolate from one vector to another, the intermediate values have a meaningful transition.

Each of these 256 dimensional vectors occupy a point in a 256 dimensional space - the same way an xyz coordinate occupies a point in a 3 dimensional space, or an xy coordinate occupies a 2 dimensional point (like on a map or something). With 256 dimensions, the point is in a 256 dimensional space (>3 dimensions are referred to as high dimensional spaces). So the metric properties of these embeddings mean that you can meaningfully move between these points, with the "meaning" of the intermediate locations still making sense.

If you've ever seen some of these GAN interpolation .gifs, where images smoothly interpolate between animals or people, this is the same principle - points in the GAN's representation space are interpolated between (eg from cat to tiger, then from tiger to lion), and all points in between still produce sensible results, with gradually changing meaning.

These metric spaces are typically learnt using some different learning methodologies than what you may know about neural networks. Instead of telling a model explicitly what to output at the end, for each data sample, we instead use comparative losses such as Triplet or Contrastive losses. These take pairs (or triplets) of data and they tell the model to bring two similar data points closer together in this embedding space, while pushing a third point away. Repeat this for many data points, and many combinations, and you end up with a space where similar data points are close together in the embedding space, while larger groups of data are separate from each other.

t-SNE

Having a point in a 256-dimensional space is nice, and all, but there's no way we can picture what such a high dimensional space might look like. Even 4D is something our mushy brains can't do. So if we're to be able to visualise our embedding space, we need to somehow show this space in a lower dimension (eg 2D or 3D), while still keeping the same meaning of the data.

There are several methods to do this, the main ones being PCA, and t-SNE. I won't go into the details of how they work here, but PCA is good for keeping the same global meaning of data, in the visualisation (useful for then using the data for more calculations), and t-SNE doesn't keep that information, but does keep local information (clusters of points are meaningful, but the distance between clusters doesn't mean anything).

Both of these algorithms project high dimensional data into a lower number of dimensions defined by the user, which means we can actually visualise the data in a meaningful way. For an example, here's a figure from one of my research papers (https://arxiv.org/pdf/2103.09776.pdf), where I projected 256-dimensional embeddings of learnt artistic style of images, into 3D:

You can learn a lot more about everything I've said so far in that paper, if you're up for it.

Nearest neighbour search

The next concept is a simpler one - a method for use in searching through these embeddings. If you've ever wondered how someone can search with an image, and get images back, this post is for you! Behind the scenes, an image can get passed through a neural network, and one of these embeddings is inferred, mapping the image into a point in the high dimensional space. Given a larger database of other such points (corresponding to other images), it is easy to perform search.

We'll use a 2 dimensional space analogy of something like the map of Europe, for explaining nearest neighbour search. Given a query point in this space, eg London, and many other points in this space (eg every other capital city in Europe), we first compute a distance from the query point to every other point in this space. If you remember Pythagoras, this distance in xy space is just the diagonal. So once a value is assigned to the distance from the query point to every other point, it's trivial to just sort them and return the top k results (where k is whatever number of results you'd like to retrieve).

This is one algorithm you can use for this, and it scales up to any number of data dimensions, including 256 - so we can use this algorithm to perform search in our embedding space, given a new (or existing) data point as a query.

Three-dog cluster

Ok, so moving on to some actual uses of this. I've been using images to talk about this so far, but I've of course been working with audio data. This works the exact same way - the modality is not important. I've first extracted some features from the audio, and performed representation learning using each speaker identity as a label for training the separation in this space.

Given a model that can embed audio data in this way, there are several things we can do with it. One thing we can use it for is separating mixed data in a dataset of audio where there are groups of different sounding audio. A very clear example of this is Three-Dog from Fallout 3. The NPC audio files contain both a normal, and radio version of the NPC dialogue. In this particular case, it's easy to tell from the audio file names which is which, but this is not always the case - and there's typically hundreds of lines for a voice. So without an automatic way to distinguish between the different types, one would have to spend hours listening through each line, and manually sort the audio into different folders - something I've had to do several times so far...

Enter representation learning and search. Here's a t-SNE diagram of the embedding space of the Three Dog audio data from Fallout 3. The model is able to decisively separate the normal and radio files. Knowing the labels for the audio data beforehand, I was able to annotate the data points with the correct colour.

k-means clustering and silhouette score

Another concept to get out of the way is clustering. Given some data, as visualised above, we need some way to figure out what the groups are. Remember, t-SNE diagrams are good for visualisations, but the distances between points doesn't actually mean anything - it's just good for showing if there are indeed any clusters in the data. The actual clustering needs to be done differently.

A great clustering algorithm as old as time is k-means. This is an iterative unsupervised algorithm, meaning we just give it the data, and a little bit of time, and it'll give you a set of k centroids, where k is a number of clusters you pre-define, and "centroid" is the term for the middle point of a cluster (not necessarily an actual data point, just the "centre of gravity" of these clusters). It also gives you the assignment of every point in the given data to which centroid it belongs to.

I said that k is a pre-defined number of clusters you say you want from the data. This is good for the Three-Dog example, where I know beforehand that there are two main clusters of data (normal, and radio). But if you don't know how many groups of data there are, you have to use an additional step, namely to calculate a score for the quality of clustering at each step, and plot a graph of this score, as you increase k from 2 to a higher number - then finally picking the k value with the best score, and using the centroids and assignments from that run. This process can be done using the silhouette score (you can look this up for the inner workings if you wish), and it can be automated.

But anyway, all this aside, given the ThreeDog data, with embeddings extracted through a model trained for representation learning on extracted speech metadata, we can use the cluster assignment data to pick all the points from one cluster, get the associated audio data, and copy it to a folder. Do this for the audio files associated with the other cluster, and we end up with two folders, one with normal audio, and one with radio audio.

Other uses

The main reason I started developing this model is the second iteration of the xVASpeech speech-to-speech model design. A problem with the initial design is that it's very speaker dependent. So I need a way to condition a model to perform well on whatever speaker. This has typically been done by manually explicitly defining an embedding corresponding to a speaker identity (eg a 100 dimensional one-hot vector, where all values are zero, except for one value which is one, corresponding to the index of a speaker, in a list of 100 speakers). This is fine for when you have a pre-defined set of speakers, but as soon as you have more (eg 101), you have to re-train the model, because the embedding size has changed. So we need a fixed size embedding regardless of number of speakers. Moreover, the hard assignment of speaker information to an assignment label (such as one-hot) is not very useful, as you have to sound very similar as one of the speakers in that list that the model was trained for, and you'd have to choose at run-time which of those 100+ speakers sounds like you. It would be better to avoid speaker identity in the input, and instead just give the model a summary of what the speaker sounds like, in a finite-sized embedding. Even if there are differences in this embedding for within-speaker data at training time. That should allow a neural network to learn to model itself to an infinite number of speakers, even ones it wasn't trained for (zero-shot learning), and you wouldn't need any arduous model pre-configuration.

There are plenty of other things that can be done with embeddings, but I'll close off with the following visualisation I've made. A 3D t-SNE visualisation of the (work in progress) representations for the "This is what my voice sounds like" preview audio files from the app, labelled by speaker identity.