Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder

Warning

This publication doesn't include Faculty of Economics and Administration. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

PORTEŠ David HORÁK Aleš

Year of publication 2024
Type Article in Proceedings
Conference Text, Speech, and Dialogue
MU Faculty or unit

Faculty of Informatics

Citation
Doi http://dx.doi.org/10.1007/978-3-031-70566-3_13
Keywords Fundamental Frequency; Prosody; VQ-VAE; Vector Embeddings
Description Language models operating on discrete audio representa- tions are increasingly becoming the go-to framework for many speech- processing tasks. Recently, discrete embeddings of the fundamental fre- AQ1 quency (F0), have been shown to improve performance across a variety of tasks. However, the benefits of using F0 embeddings can only be as good as the embeddings themselves. Therefore, in this paper, we present an exhaustive study on using the Vector-Quantized Variational Autoencoder (VQ-VAE) to generate high-quality embeddings of the F0 curve. We experiment with various input transformations that focus on handling unvoiced regions of the F0, which are regions where F0 is not defined. For each transformation, we perform an exhaustive grid search over the embedding size and codebook size parameters, in order to achieve high- est possible embedding quality. Our experiments are conducted on two different-sized datasets, LJSpeech and LibriTTS, and, in total, comprise over 140 different experiment settings. We reach results ranging from 0.53% to 4.29% F0 Frame Error (FFE), depending on the dataset and preprocessing strategy used, and we publish our best models on the Hug- gingFace website.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.