Continuous Diffusion with VQ-VAE for Symbolic Music Generation

We introduce a novel framework that bridges the modality gap between continuous diffusion models and the generation of inherently discrete symbolic music. Our approach consists of two core components: a Mel Encoder that processes log-mel spectrogram inputs into a latent representation, and a Bar Tokenizer that maps piano rolls into a continuous latent space. This continuous embedding makes the discrete musical data amenable to any standard diffusion model for sampling and generation. A key innovation of our method is its focus on bar-level segmentation, a departure from conventional time-based splitting that allows for a more musically coherent representation.

An illustration of our proposed framework.

Images of Framework Illustration credited to Yuanhe Guo.