

Shuqi Dai
ExpressiveSinger: Multilingual and Multi-Style Score-based Singing Voice Synthesis with Expressive Performance Control
This is my internship project at NVIDIA Research from May 2022 to Feb 2023.
​
This paper takes score, lyrics, style label, and singer information as input and generates expressive and realistic singing. It involves a cascade of diffusion models. The pipeline involves (1) performance control models, including timing, F0 curves, and loudness curves; (2) an acoustic model that generates the mel-spectrograms conditioning on performance control signals; (3) a DiffWave vocoder to generate the waveform from mel-spectrograms and F0 curves. The following figure shows a high-level architecture.

Input: 1. Score; 2. Lyrics; 3. Style; 4. Singer Info
Output: Expressive and realistic singing
Generated Example
A Happy Birthday Song in Chinese sung by different singers/styles
This song is not in the training data and is generated from the score from scratch:
​BTW, many singers in this demo have never sung Chinese in the training data.
​
Multilingual & Stylistic Demo
Generated: Ground-Truth:
Generated: Ground-Truth:
Generated: Ground-Truth:
Some Opera Singing Generated:
​