Authors: Matthew Baas, Benjamin van Niekerk, Herman Kamper
Arxiv: https://arxiv.org/abs/2305.18975
Abstract: Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity - making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods.
Abstract read by Sir David Attenborough:
Code and pretrained models are available here.
If you are having trouble listening to the audio, please refresh the page.
In this section, we apply kNN-VC to unseen source and target speakers from LibriSpeech.
targets | ||||||
---|---|---|---|---|---|---|
2300 | 237 | 260 | 1284 | 1089 | 4507 | |
source | ||||||
In this section, we compare kNN-VC against VQMIVC, YourTTS, and FreeVC.
source | target | kNN-VC | VQMIVC | YourTTS | FreeVC |
---|
In this section, we investigate the effect of target data size and prematched training. We convert to a single target speaker for all examples, varying the amount of data for the matching set.
target: |
---|
prematched | source | 5 secs | 10 secs | 30 secs | 1 min | 5 mins | 8 mins |
---|---|---|---|---|---|---|---|
✗ | |||||||
✗ | |||||||
✓ | |||||||
✓ |
In this section, we apply kNN-VC to unseen languages, whispered speech, and even non-speech sounds. We hope to explore these areas more in future work.
For the cross-lingual examples, we use data from CSS10.
source | target | output | |
---|---|---|---|
es-de | |||
de-ja | |||
zh-es |
For the whispered examples, we use data from CHAINS.
targets | ||||||
---|---|---|---|---|---|---|
irm12 | irf06 | irm06 | irf04 | |||
source | ||||||
We also convert a song to a whisper (including the drum beat):
source | output |
---|---|
To see how kNN-VC handles non-human sounds, we apply it to an audio clip of a barking dog.
person | dog | |
---|---|---|
source | ||