Voice Conversion With Just Nearest Neighbors

Authors: Matthew Baas, Benjamin van Niekerk, Herman Kamper

Arxiv: https://arxiv.org/abs/2305.18975

Abstract: Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity - making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods.

Code and pretrained models are available here.

If you are having trouble listening to the audio, please refresh the page.


Samples from kNN-VC

In this section, we apply kNN-VC to unseen source and target speakers from LibriSpeech.

targets
2300 237 260 1284 1089 4507
source


Comparison to other methods

In this section, we compare kNN-VC against VQMIVC, YourTTS, and FreeVC.

source target kNN-VC VQMIVC YourTTS FreeVC


Ablations

In this section, we investigate the effect of target data size and prematched training. We convert to a single target speaker for all examples, varying the amount of data for the matching set.

target:
prematched source 5 secs 10 secs 30 secs 1 min 5 mins 8 mins


Bonus stuff

In this section, we apply kNN-VC to unseen languages, whispered speech, and even non-speech sounds. We hope to explore these areas more in future work.

Cross-lingual voice conversion

For the cross-lingual examples, we use data from CSS10.

source target output
es-de
de-ja
zh-es

Whispered speech conversion

For the whispered examples, we use data from CHAINS.

targets
irm12 irf06 irm06 irf04
source

We also convert a song to a whisper (including the drum beat):

source output

Dog-person conversion

To see how kNN-VC handles non-human sounds, we apply it to an audio clip of a barking dog.

person dog
source