Authors: Matthew Baas, Benjamin van Niekerk, Herman Kamper
Arxiv: https://arxiv.org/abs/2305.18975
Abstract: Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity - making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods.
Code and pretrained models are available here.
If you are having trouble listening to the audio, please refresh the page.
In this section, we apply kNN-VC to unseen source and target speakers from LibriSpeech.
targets | ||||||
---|---|---|---|---|---|---|
2300 | 237 | 260 | 1284 | 1089 | 4507 | |
source | ||||||
In this section, we compare kNN-VC against VQMIVC, YourTTS, and FreeVC.
source | target | kNN-VC | VQMIVC | YourTTS | FreeVC |
---|---|---|---|---|---|
In this section, we investigate the effect of target data size and prematched training. We convert to a single target speaker for all examples, varying the amount of data for the matching set.
target: |
---|
prematched | source | 5 secs | 10 secs | 30 secs | 1 min | 5 mins | 8 mins |
---|---|---|---|---|---|---|---|
✗ | |||||||
✗ | |||||||
✓ | |||||||
✓ |
In this section, we apply kNN-VC to unseen languages, whispered speech, and even non-speech sounds. We hope to explore these areas more in future work.
For the cross-lingual examples, we use data from CSS10.
source | target | output | |
---|---|---|---|
es-de | |||
de-ja | |||
zh-es |
For the whispered examples, we use data from CHAINS.
targets | ||||||
---|---|---|---|---|---|---|
irm12 | irf06 | irm06 | irf04 | |||
source | ||||||
We also convert a song to a whisper (including the drum beat):
source | output |
---|---|
To see how kNN-VC handles non-human sounds, we apply it to an audio clip of a barking dog.
person | dog | |
---|---|---|
source | ||