Voice Conversion With Just Nearest Neighbors

Authors: Matthew Baas, Benjamin van Niekerk, Herman Kamper

Arxiv: https://arxiv.org/abs/2305.18975

Abstract: Any-to-any voice conversion aims to transform source speech into a target voice with just a few examples of the target speaker as a reference. Recent methods produce convincing conversions, but at the cost of increased complexity - making results difficult to reproduce and build on. Instead, we keep it simple. We propose k-nearest neighbors voice conversion (kNN-VC): a straightforward yet effective method for any-to-any conversion. First, we extract self-supervised representations of the source and reference speech. To convert to the target speaker, we replace each frame of the source representation with its nearest neighbor in the reference. Finally, a pretrained vocoder synthesizes audio from the converted representation. Objective and subjective evaluations show that kNN-VC improves speaker similarity with similar intelligibility scores to existing methods.

Abstract read by Sir David Attenborough:

Code and pretrained models are available here.

If you are having trouble listening to the audio, please refresh the page.

Samples from kNN-VC

In this section, we apply kNN-VC to unseen source and target speakers from LibriSpeech.

	targets
	2300	237	260	1284	1089	4507
source

Comparison to other methods

In this section, we compare kNN-VC against VQMIVC, YourTTS, and FreeVC.

source	target	kNN-VC	VQMIVC	YourTTS	FreeVC

Ablations

In this section, we investigate the effect of target data size and prematched training. We convert to a single target speaker for all examples, varying the amount of data for the matching set.

target:

prematched	source	5 secs	10 secs	30 secs	1 min	5 mins	8 mins
✗
✗
✓
✓

Bonus stuff

In this section, we apply kNN-VC to unseen languages, whispered speech, and even non-speech sounds. We hope to explore these areas more in future work.