xvasynth

v1.4.0 pre-release: Speech-to-Speech; Plugins; Much more...

Added 2021-05-31 14:02:19 +0000 UTC

A long time coming, the next version of the app is nearing release. This is by far the largest single update to the app (except for the v0.1 -> v1.0, of course). There's a couple of big new features, but let's get the changelog out of the way.

Changelog:

- Added a speech-to-speech mode
- Added a third party Plugin system. Developer reference: https://github.com/DanRuta/xVA-Synth/wiki/Plugins
- Added numerical input field for pacing
- Added internationalization support for the app (other languages for the app UI now supported, via the plugins system)
- Added a settings option to disable .json output
- Added an option to name files numerically sequentially (filename_0001.wav, etc)
- Added a button on records to edit the audio files with a configurable external program (eg Audacity)
- Added a setting option to allow keeping the same editor state while switching voice (transfer speaking style from voice A to voice B)
- Added alt key modifier to the editor for making whole word selection range
- Added hover tooltips to the records' utility buttons
- Added the community guide link to the info menu
- Stopped clearing editor selection on letter mis-click
- Improved app UI on first-time use before anything is configured
- Fixed UI bug where the chrome was flickering when changing voice models
- Remove final full-stops from the end of the file name when saving
- Made the editor letters more readable via better CSS
- Added setting for the fade opacity on the background image
- Made UI modal switching snappier
- Various other small UI tweaks and fixes
- Added user setting for font size in the text input
- Added game series support for: Dragon Age, Command and Conquer, Star Trek: Bridge Commander
- Fixed amplitude multiplier input not updating enabled/disabled on ffmpeg check
- Fixed issue where attempting to generate empty input broke future gens
- Shortened file names to alleviate windows max file name issues
- Fixed the window looking squished at small window sizes
- Re-disable editor controls when unselecting letters

Speech to Speech

The app so far has supported a single method of performing speech synthesis: text-to-speech. This is where someone enters some text into the prompt, and the app generates the speech data from it. With the FastPitch models specifically, that the app uses, there are initially predictions for pitch and duration values, which then get converted into speech data (then turned into audio by one of the vocoder models).

This next update is a first step into a different, alternative way to do speech data generation. Rather than entering some text in a prompt, the app will let you speak into the microphone (or drag+drop an audio file onto the icon), by clicking the new microphone icon.

FastPitch is great, because of the aforementioned approach to first generate intermediate values for pitch and duration, for each of the letters. These values are what enables the editor in the app. However, Speech-to-Speech (S2S) research is nowhere near as well developed, compared to Text-to-Speech (TTS). No such models exist that allow the same degree of control that the TTS model has.

So, I had to make one. I've spent the last couple months actively researching this, running experiments, and so on, to create a brand new neural network to be able to read in audio (from a microphone or from an audio file), and predict from it the text, its pitch, and its duration values. These values can then be used downstream to inject them into a FastPitch model, thereby performing voice cloning (speech-to-speech). But the great thing with it is that you still have the same degree of control over every letter's pitch and duration values in the editor, for making adjustments as before.

I've decided to call this model xVASpeech, and the minimal inference code for this is now present in the pre-release. It took a very long time to get to a point where the architecture of the model has been finalized, but now that it's ready, I can get the next app version underway, while I finish off developing the actual models for this feature.

As a quick insight into the progress, at the moment, the predicted pitch and duration values are perfect by design. The main problem left for me to fix is the TEXT that comes out. Where the input voice is the same as the xVASpeech voice (eg loading an (unseeen during training) audio line for an xVASpeech model trained on Nate data), the text mostly comes out pretty good. But the more different a voice is from the trained model, the worse the text that comes out. If anyone has any experience with NLP or automatic speech recognition (through deep learning) and they wish to help, do let me know!

---

I will make another showcase video showing how it's used, when it's ready. The way it works currently, is that I train a separate xVASpeech model for each of the more "normal" voices from the app (eg Nate, Nora, FemaleShepard, etc). These will just be another model file per voice, like with the HiFi vocoder. When setting up the feature on your end, you need to pick from the available models whichever voice sounds most like you. Your voice needs to be very similar to the voice from the model, otherwise it won't work well. If you've given me data to train up voice models with your own voice, you will have your own xVASpeech model to use, which is what you need to select (can't get more similar than that).

(note: those models are test data)

Once the xVASpeech model has been selected once, you can then pick your target voice from any of the FastPitch models from the voices panel on the left, and click the microphone icon to record your voice, and clone it. You can alternatively drag+drop an audio file onto the microphone icon, if you prefer to record a file outside the app.

There are a few settings built into the app to hopefully reduce the impact of voice differences upon the quality of the generated out, like pre-adjusting the average pitch of your voice to match the voice of the xVASpeech model, background noise removal, and such (with more maybe coming later).

The focus for now remains on squeezing as much quality out of the training process design as I can, and then I will publish some models here. Just keep in mind that this is not yet ready for use, as it's still very much a work in progress.

To get an idea of what this whole thing will do, listen to this voice sample from Nate:

https://voca.ro/167KUAobGxRX

Plugging this into the Speech-to-Speech feature, with the aim of converting it to Cait's voice, with the SAME timing and pitch variations from Nate's line, we get the following:

https://voca.ro/1cqGRR6l3Ew0

Just Cait, for reference: https://voca.ro/157CnKdS6ywh

The predicted text for Cait here wasn't perfect, as you can tell ("tied", not "tired"), but this can be adjusted in the editor.

Third-party Plugins system + App internationalization

There was a previous pre-release just for this feature a while ago, though since then there have been a couple of small quality of life additions. Namely, the python boilerplate code is now basically non-existent, and new teardown functions have been added to allow a plugin to run some code when it's being uninstalled (for any cleaning up, or reverting any changes).

This handily brings me to next change, app internationalization. All text from within the app has now been moved to a "strings" file, which anyone could very easily change via a plugin, to change the language of the app. I've included such an example language plugin, for Romanian (as I know the Romanian language). If you decide to explore this, place the contents of the .zip fle into the "plugins" folder

---

I think there's good potential for app customization via the plugin system, for various features that would be good to have, but that would be too game-specific, or too niche, to include in the base app. One example plugin that I'll make (unless someone beats me to it - and saves me the work 😂 ) is a plugin to automatically generate .lip and .fuz files for Skyrim/Fallout 4.

Check out the documentation here, if you're interested in creating a plugin for the app: https://github.com/DanRuta/xVA-Synth/wiki/Plugins . I think it's quite easy to follow (though I am biased), and there's minimal setup work to integrate code into the app. I will of course be available on Discord or wherever you prefer, to chat about this, and lend a hand if I can help.

Other changes

There are quite a few other changes, including a community guide for collecting tips/advice for how to get the best quality out of using the app over here: https://github.com/DanRuta/xvasynth-community-guide . This is linked to from within the info tab in the app.

You can download the v1.4.0 pre-release from here: https://drive.google.com/file/d/1M7Dk_KdEZpqL502aB-9ecLNImKBX0vfd/view?usp=sharing

Do let me know if there are any issues, before I update the Nexus.

Short term plan

I will have to take a few days off before the next poll, mainly because I'll be super busy using my PC to train xVASpeech models, and I also have a spike in requests to attend to. Things will be back to normal by the end of this week, hopefully with some xVASpeech models ready to play with!