xvasynth

Introducing: xVAScribe

Added 2021-06-29 12:08:27 +0000 UTC

xVAScribe is the first companion app for xVASynth. This small tool is for use in creating/improving speech datasets, for use during model training.

There are two primary ways you can use the app:

1) For adjusting existing datasets, by changing the text transcript associated with audio data

2) For creating a brand new dataset of your own voice, via the microphone

To start, download the latest release (v1.0 linked at the bottom), and double click the xVAScribe.exe file.

You can initialize a dataset in three ways:

1) completely fresh by clicking the + button at the top right.

2) from existing transcript data only, by drag+dropping a .csv or .txt file into the drop zone

3) from existing audio and/or text data, by creating a folder in the "datasets" folder, placing a "metadata.csv" file (if you have a transcript), and "wavs" folder in it with the audio files, following the usual speech dataset formatting. You can check what this format looks like by first creating a couple of dummy lines in a fresh dataset.

The .txt file structure needs to be \n newline separated transcript lines.

Recording a new dataset via the microphone

You can record new audio for lines in your dataset, be it new lines you'll enter through the app, or existing lines in an imported dataset transcript (recommended).

You can select which microphone you wish to use in the settings menu (accessible via the settings cog at the top right). Also here, you can explore using automatic background noise removal and its strength, after first recording some noisy silence. Experiment with it on/off, and the strength to see what works best for your setup.

The app uses incremental numerical zero padded file names for new lines.

I recommend finding a transcript from an existing speech dataset, such as MAILABS or LJSpeech, and using one of their transcripts, to avoid having to devise a transcript content of your own. But feel free to add your own lines, if there is some unique vocabulary you'd like the model to be good at.

Adjusting/writing new transcripts for existing audio data

All buttons are mapped to keyboard shortcuts to enable a good pace of getting through the work. To further help speed things up, the app has a built-in model for automatic speech recognition, to optionally get a noisy first pass on all audio. Clean this up where it doesn't work well. The model needs the audio to be mono and 22050Hz.

Check out the new channels on the Discord server to chat about support, suggestions, and future plans for this.

Download v1.0 here: https://drive.google.com/file/d/1ObNG74xLYU-mwEFpmdDC9MP-7ePUPOZN/view?usp=sharing