xvasynth

v2.0 pre-release; Success with v2 FastPitch models

Added 2021-09-30 11:01:37 +0000 UTC

Here we go - v2.0 is here! And after a long time trying, so is the new training script for the v2.0 models, using the new and improved FastPitch architecture!

v2.0 is a huge update, with lots of things I've been wanting to add for quite some time, so I'm super excited to get this out.

I'll say a bit about the main features (and a new showcase video will be out when the app is out of the beta pre-release), but let's get the changelog out of the way:

Changelog:

Main app:
- Merged CPU and CPU+GPU - single installation for both, going forward
- Added support for FastPitch 1.1 models
- Added energy as another control vector for the sequence editor - affecting per-letter loudness
- Added Nexus integration, for downloading/endorsing voices from any nexus page
- Added 3D voice embeddings visualizer, for voice similarity search
- Added app support for ARPAbet phoneme based input/editing, to allow inputting exact pronunciation
- Added an ARPAbet dictionaries management menu, for automated explicit pronunciation insertion during synth
- Pitch/duration(and now energy) editor v2.0. Re-implemented with a canvas rather than intricate, limited, and hacky css+html
- Audio player v2.0, now showing the audio waveform above the pitch/duration editor
- WaveGlow vocoder models are now optional
- Made WaveGlow vocoders' paths configurable
- Added setting+ability to play back only changed audio, on regenerations
- App start-up is now much faster; Huge app backend re-write/re-organization
- Added a tip-of-the-day menu to the app start-up, with plenty of tips for app usage
- Added configurable toggleable error sound to errors+set task icon to red
- Bundled ffmpeg installation into the app. Set "on" by default.
- Added option to pre-apply ffmpeg to the preview sample
- Improved front-end startup loading scalability, making startup much faster. There were too many voices to load at once!
- Re-implemented game asset file handling to use .json metadata files rather than filename metadata
Batch:
- Major batch mode speed gains from several tweaks, changes, and optimizations
- Added pagination to the batch screen records, to get rid of UI lag
- Added timing and ETA info display to batch mode
- Stability changes for batch mode, especially fixing Fast mode
- Fixed batch mode multi-processing not working correctly
- Added setting for the CSV delimiter used in batch mode preprocessing
- Added GPU VRAM usage meter to the batch menu
- Added scalability settings toggles for batch importing grouping tasks (for 100k+ line files)
- Added yellow paused colour to task icon when batch mode is paused
- Removed server.log logging from batch mode, apart from errors
- Fixes and improvements for batch edit action button
Plugins:
- Added Nexus endorsements integration to plugins with a nexus link
- Added .ini settings file registering to give plugins an easy way to have custom settings in the settings menu
- Fixed plugins not being detected in some cases
- Added a pluginsContext parameter to server calls, for custom front-end data access from back-end plugins
- Added mp-output-audio plugins event for python code
- Added plugins event for mid-synth to allow custom python control over the generated pitch/durs
- Added batch-stop front-end event for plugins
- Added error message to plugins init of JavaScript scripts, with error stack
- Hoisted front-end functions to window scope for easy use/replacement in front-end plugins
Other:
- Added voice info hover tooltip next to voice titles
- Added Discord status updating support, toggleable via settings
- Added a search bar and sort buttons for the main page output voice records
- Added windows filepicker buttons for all configurable file paths and directories
- Added Ctrl+Shift+I console to production build, to aid user support debugging
- Added support for other/expanded alphabets, starting with "english_french_hepburn"
- Added game support for: Overwatch, Team Fortress 2, Nier: Automata, The Outer Worlds, Humankind, Hyperdimension, Persona, Portal
- Added search bar for settings in the settings menu
- Added a 'Delete All' button to delete all the files generated for a voice
- Fixed first-time launch issue where output directory paths were not being correctly initialized
- Fixed first-time launch issue where models were not found and the app had to be restarted
- Cleaned up spammy error logging for when my server can't be reached for changelog/credits updates
- Moved "Auto regenerate" to the settings menu
- Add a regex whitelist filter for text input, to stop unique chars from breaking the app
- Improved keyboard navigation for rapid Ctrl+S usage
- Added an option to also display the ffmpeg amplitude multiplier setting in the editor bar
- Optimized asset images, reducing over 50% of total file size (>12Mb)
- Upgraded internal pytorch version from 1.7.0+cu10 to 1.9+cu111
- Added a /getAvailableVoices API endpoint for headless operations (this still needs the front-end to be running)
- Fixed some issues with automatically numbered output filenames
- Misc bug fixes
- Misc UI fixes and improvements

The first most obvious change is that now there are no longer two versions of the app, for either CPU-only, or both CPU+GPU. There is now just one single, unified app, and you can switch between the two modes in the settings.

The next most visually obvious change is the new editor, and the new audio player, which together should greatly improve the ease of use of the app, and hopefully make things further more intuitive to use (eg dragging letters' elements sideways for length, rather than click a letter then slide a slider elsewhere)

Also part of the new editor, and now something that depends on the new models is Energy - a new control vector, just like pitch and duration. This affects how loudly a letter is spoken (with how much energy). This is editable via the circles.

Another one of the big new features (and one I'm super excited for) is ARPAbet notation. This lets you write your text like normal, but you can also specify the exact pronunciation you want between { } brackets, rather than having to fiddle around with spelling and such (check the input text in the previous screenshot). You can find a reference of all the "pronunciation" notation in the new ARPAbet menu. You can also manage dictionaries here, where you can define words that you want the app to automatically convert to the correct pronunciation. I've included CMUdict with the app, a dictionary which already has 135k common words and their pronunciations (in american! english), and a dictionary for xVASynth terminology. But you can add new dictionaries from anyone - this will be super useful for any games with lots of unique words/names. ARPAbet support is another thing which is supported only by the v2.0 models.

xVADict is a community project where I've collected all unique names/words from game series, ready for assigning ARPAbet notation to. There are about 4k words for "Elder Scrolls" and "Fallout", so this will have to be a community effort as it's too much for any one person to do alone. If you're up for it, I've started a new channel on Discord called "xVADict Project" with some instructions, if you'd like to contribute.

Another big new feature (and why I stopped doing early access voices on Patreon) is Nexus integration. If you click the "Get more voices" button at the bottom left, it will bring up a menu where (if you're logged in) you can check for all the downloadable voices uploaded to the nexus, by myself or anyone else. On that note, to add nexus pages to the list for the app to check, you can go to the "Manage repos" sub-menu, where you can search the nexus for anyone's page, and add it to the pages checked.

Next, if you really like the sound of a voice, and you want other similar voices, but there's too many to check, you can now check the new "Voice embeddings visualiser" menu, where an interactive 3D plot of all the voices is shown. Similar voices will be close together, and dissimilar ones will be far away from each other.

Finally, a feature that has long been in ongoing research and development, is finally here! (or at least getting close!) Speech-to-speech v1.0 is now finished, and implemented in the app! Again, this is a feature which only works for the new v2.0 model, FOR THE INPUT models. So the way this works is as follows:

You have an INPUT speech model, and an OUTPUT speech model. The output speech model is any one of the regular models used so far from the panel on the left. The input model is also one of those models, but this one HAS to be one of the new v2.0 models. The plan is for me to go back and re-train all 400+ models with this new one, so eventually it'll be any other voice. Any input model works, really, but for good/the best quality, one should be picked to best match the input voice - be it recorded via a microphone, or an existing audio file drag+dropped over the mic icon. You can also optionally pre-specify text to make sure the text is perfect - otherwise (if the text input field is blank) a new state-of-the-art speech-to-text model is used to first infer some input text.

---

Those are the main new features, but check the changelog for everything else, eg WaveGlow models now being optional, batch mode v2.0, all the new plugins stuff, etc. I've got one more big exciting announcement about the eventual (October) v2.0 public release, but that's still being worked on - I'll post an update in the following days if it works out ok.

The new training scripts for the v2.0 models are now functional, and whirring away at top speed! Things are a bit slow now, at the start, like they were when I was starting out with the original FastPitch, but they will get faster as I continue optimizing the scripts. Although it's not yet finished, I have included the first of these models, so that you have something to use, when playing around with the v2.0-model exclusive features. The voice updates will soon start coming!

The full release for the app will come soon, but for now, although I've tested the app lots through use, during the long development period, there likely still are a couple of bugs, here and there. I would be super thankful for any reports on issues, especially around the new additions!

v2.0 pre-release: https://drive.google.com/file/d/1iIvDW9r4V6Wy3pZnCYlvk8rDhaxbmA-g/view?usp=sharing

The first v2.0 voice (Skyrim: FemaleEvenToned) - again, not fully finished, but usable: <attached to this post>

A second non-game voice that I trained on LJSpeech (named fp1_1_test): <also attached>