KNN Voice Conversion

Voice Conversion With Just k-Nearest Neighbors. The source and reference utterance(s) are encoded into self-supervised features using WavLM. Each source feature is assigned to the mean of the k closest features from the reference. The resulting feature sequence is then vocoded with HiFi-GAN to arrive at the converted waveform output.

src_wav_path

ref_wav_paths

Top-k

These default settings provide pretty good results, but feel free to modify the kNN topk

3 10

output

If the model contributes to your research please cite the following work:

Baas, M., van Niekerk, B., & Kamper, H. (2023). Voice conversion with just nearest neighbors. arXiv preprint arXiv:2305.18975.

demo contributed by @wetdog