Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Anonymous submission to Interspeech 2026

UDD

TDD

Abstract

Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models first determine durations then diffuse on fixed alignments, often collapsing to mean prosody and applying uniform stretching when adapting speaking rate. Single-stage attention-based models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one probabilistic process. On LJSpeech, our method achieves 3.37% WER vs. 4.38% for Grad-TTS with better UTMOSv2. In out-of-distribution slow speech, our model autonomously inserts natural pauses rather than stretching uniformly, improving intelligibility over two-stage baselines.

Model Overview

Method

Noisy and incomplete → Clean and complete

Our model progressively refines the mel-spectrogram and jumps to target length. Below are samples of the mel-spectrogram and the corresponding vocoded audio at different iterations of the diffusion process.

Decoder state Mel-spectrogram sample Vocoded audio
t = 0.998 mel 0
t = 0.797 mel 1
t = 0.579 mel 2
t = 0.397 mel 3
t = 0.198 mel 4
t = 0.000 mel 5

Speech Samples

Text GT Grad-TTS OneShot (Ours) TDD UDD (Ours)
together with a great increase in the payrolls there has come a substantial rise in the total of industrial profits
and recognized as one of the frequenters of the bogus lawstationers His arrest led to that of others
All the committee could do in this respect was to throw the responsibility on others
who came from his room ready dressed a suspicious circumstance as he was always late in the morning
yet he could not overcome the strange fascination it had for him and remained by the side of the corpse till the stretcher came
The inadequacy of the jail was noticed and reported upon again and again by the grand juries of the city of London
is closely reproduced in the lifehistory of existing deer Or in other words

Slow Speech (0.75x): Uniform Stretching vs Adaptive Pause

Grad-TTS UDD (Ours)