top of page

A Diffusion Pipeline for
Multilingual Singing Voice Synthesis with Expressive Style Control

This is my internship project at NVIDIA Research from May 2022 to Feb 2023. The paper has not been published yet (hopefully coming out soon). 

This paper takes score, lyrics, style label, and singer ID as input and generates expressive and realistic singing. It involves a cascade of diffusion models. The pipeline involves (1)  performance control models, including timing, F0 curves, and loudness curves; (2) an acoustic model that generates the mel-spectrograms conditioning on performance control signals; (3) a DiffWave vocoder to generate the waveform from mel-spectrograms and F0 curves. The following figure shows a high-level architecture.

Screen Shot 2023-11-06 at 09.02.34.png

Input:       1. Score; 2. Lyrics; 3. Style; 4. Singer ID

Output:    Expressive and realistic singing

Generated Example
A Happy Birthday Song in Chinese sung by different singers/styles

This song is not in the training data and is generated from the score from scratch:

​BTW, many singers in this demo have never sung Chinese in the training data.

Multilingual & Stylistic Demo

Generated:                                    Ground-Truth:

Generated:                                    Ground-Truth:

Generated:                                    Ground-Truth:

 

Some Opera Singing Generated:

00:00 / 01:02
00:00 / 00:15
00:00 / 00:08
00:00 / 00:08
00:00 / 00:15
00:00 / 00:08
00:00 / 00:08
00:00 / 00:16
00:00 / 00:21
00:00 / 00:19
bottom of page