Transcript no longer required
Added 2021-03-24 09:38:09 +0000 UTCAs you may have seen from previous posts, and messages on Discord, I've been experimenting in the background with ways to get around unavailable transcripts for voices. This was spurred on by the Cyberpunk requests, for which the transcript is not (yet) available.
Given just the audio files, I've been working on automating the transcription process. At the core of the method is another AI model doing speech recognition. I simply run the audio lines through such a model and get an initial set of text lines for each audio file. The lines aren't perfect, but the text mistakes still would sound similar when spoken out, so this doesn't actually matter.
However, this introduces a bit of an issue, in that currently (to my knowledge) automatic speech recognition models can't do punctuation. So this process by itself doesn't add the necessary dots, commas, question marks and such that are needed for faithful reconstruction in the speech synthesis part - and can sometimes lead to weird, fluctuating pacing.
Thankfully, AI research has many fields. Conveniently, another field is "punctuation restoration". This line of research does exactly what you would think - given a line of text without punctuation, it does its best to estimate some sensible punctuation. I've bolted this on, and most punctuation now also works.
This is still not perfect of course, but after messing with this for a little while now, I've added an additional post-processing script to automatically correct quite a few common errors.
--
So this now means that going forward, transcripts are no longer a requirement for new voices. This is especially good for games that don't have transcript extraction capabilities, such as Cyberpunk or Fallout 76. HOWEVER, where possible I would still heavily encourage the use of a transcript, because the processes described above are imperfect. It works, but for the best quality, transcripts are still the best - though the quality difference is rapidly diminishing and almost no longer noticeable.
Some of the (requested) voices trained with this automatic transcription process are:
- Cyberpunk 2077: V (Male)
- Cyberpunk 2077: V (Female)
- Civilization IV: Narrator
- Fallout 76: MODUS
- 1 private voice request
- 1 upcoming surprise voice