Rhythm Modeling for Voice Conversion
Authors: Benjamin van Niekerk, Marc-André Carbonneau, Herman Kamper
Abstract: Voice conversion aims to transforms source speech into a different target voice. However, typical voice conversion systems do not account for rhythm, which is an important factor in the perception of speaker identity. To bridge this gap, we introduce Urhythmic—an unsupervised method for rhythm conversion that does not require parallel data or text transcriptions. Using self-supervised representations, we first divide source audio into segments approximating sonorants, obstruents, and silences. Then we model rhythm by estimating speaking rate or the duration distribution of each segment type. Finally, we match the target speaking rate or rhythm by time-stretching the speech segments. Experiments show that Urhythmic outperforms existing unsupervised methods in terms of quality and prosody.
Code and pretrained models are available here.
If you are having trouble listening to the audio, please refresh the page.
Samples from Urhythmic
This section contains speech samples from our fine-grained approach.
We pick the three fastest and three slowest speakers from VCTK.
To avoid conflating accent and speaker identity, we limit the selection to a single region (Southern
England).
The target speakers in the following table are ordered by speaking rate from slowest on the left to
fastest on the right.
targets | ||||||
---|---|---|---|---|---|---|
p228 | p268 | p225 | p232 | p257 | p231 | |
source | ||||||
Comparison to baselines
Here we present samples from the subjective evaluations.
We compare against two baselines: AutoPST and DISSC.
source | target | AutoPST | DISSC | Urhythmic global | Urhythmic fine |
---|---|---|---|---|---|
Duration Control
Next, we use Urhythmic to edit the rhythm of an utterance without supervision.
For this example, we cut the dendrogram (Fig.4 in the paper) into eight clusters to illustrate control
over the different sound types (vowel, approximant, nasal, fricative, stop, silence).
We visulaize the effect of stretching or contracting some clusters in Fig.2.
Next, we strech segments in different clusters slow different sound types down by a factor of two.
sound type | cluster# | |
---|---|---|
no-modification | ||
fricatives | 6,7 | |
vowels | 2,4 | |
silences | 0 | |
stops | 3 | |
approximants | 1 | |
nasals | 5 |
Supplementary Results
This section presents an additional visualization of the correlation between the estimated speaking rate
and the syllable rate.