Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder
Authors | |
---|---|
Year of publication | 2024 |
Type | Article in Proceedings |
Conference | Text, Speech, and Dialogue |
MU Faculty or unit | |
Citation | |
Doi | http://dx.doi.org/10.1007/978-3-031-70566-3_13 |
Keywords | Fundamental Frequency; Prosody; VQ-VAE; Vector Embeddings |
Description | Language models operating on discrete audio representa- tions are increasingly becoming the go-to framework for many speech- processing tasks. Recently, discrete embeddings of the fundamental fre- AQ1 quency (F0), have been shown to improve performance across a variety of tasks. However, the benefits of using F0 embeddings can only be as good as the embeddings themselves. Therefore, in this paper, we present an exhaustive study on using the Vector-Quantized Variational Autoencoder (VQ-VAE) to generate high-quality embeddings of the F0 curve. We experiment with various input transformations that focus on handling unvoiced regions of the F0, which are regions where F0 is not defined. For each transformation, we perform an exhaustive grid search over the embedding size and codebook size parameters, in order to achieve high- est possible embedding quality. Our experiments are conducted on two different-sized datasets, LJSpeech and LibriTTS, and, in total, comprise over 140 different experiment settings. We reach results ranging from 0.53% to 4.29% F0 Frame Error (FFE), depending on the dataset and preprocessing strategy used, and we publish our best models on the Hug- gingFace website. |
Related projects: |