Voice Conversion With Just Nearest Neighbors

Authors: Matthew Baas, Benjamin van Niekerk, Herman Kamper

Arxiv: https://arxiv.org/abs/2305.18975

Abstract: Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity - making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods.

Abstract read by Sir David Attenborough:

Code and pretrained models are available here.

If you are having trouble listening to the audio, please refresh the page.

Samples from kNN-VC

In this section, we apply kNN-VC to unseen source and target speakers from LibriSpeech.

2300 237 260 1284 1089 4507

Comparison to other methods

In this section, we compare kNN-VC against VQMIVC, YourTTS, and FreeVC.

source target kNN-VC VQMIVC YourTTS FreeVC


In this section, we investigate the effect of target data size and prematched training. We convert to a single target speaker for all examples, varying the amount of data for the matching set.

prematched source 5 secs 10 secs 30 secs 1 min 5 mins 8 mins

Bonus stuff

In this section, we apply kNN-VC to unseen languages, whispered speech, and even non-speech sounds. We hope to explore these areas more in future work.

Cross-lingual voice conversion

For the cross-lingual examples, we use data from CSS10.

source target output

Whispered speech conversion

For the whispered examples, we use data from CHAINS.

irm12 irf06 irm06 irf04

We also convert a song to a whisper (including the drum beat):

source output

Dog-person conversion

To see how kNN-VC handles non-human sounds, we apply it to an audio clip of a barking dog.

person dog