Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Abstract

Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models first determine durations then diffuse on fixed alignments, often collapsing to mean prosody and applying uniform stretching when adapting speaking rate. Single-stage attention-based models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one probabilistic process. On LJSpeech, our method achieves 3.37% WER vs. 4.38% for Grad-TTS with better UTMOSv2. In out-of-distribution slow speech, our model autonomously inserts natural pauses rather than stretching uniformly, improving intelligibility over two-stage baselines.

Model Overview

Noisy and incomplete → Clean and complete

Our model progressively refines the mel-spectrogram and jumps to target length. Below are samples of the mel-spectrogram and the corresponding vocoded audio at different iterations of the diffusion process.

Decoder state	Mel-spectrogram sample	Vocoded audio
t = 0.998
t = 0.797
t = 0.579
t = 0.397
t = 0.198
t = 0.000

Speech Samples

Text	GT	Grad-TTS	OneShot (Ours)	TDD	UDD (Ours)
together with a great increase in the payrolls there has come a substantial rise in the total of industrial profits
and recognized as one of the frequenters of the bogus lawstationers His arrest led to that of others
All the committee could do in this respect was to throw the responsibility on others
who came from his room ready dressed a suspicious circumstance as he was always late in the morning
yet he could not overcome the strange fascination it had for him and remained by the side of the corpse till the stretcher came
The inadequacy of the jail was noticed and reported upon again and again by the grand juries of the city of London
is closely reproduced in the lifehistory of existing deer Or in other words

Slow Speech (0.75x): Uniform Stretching vs Adaptive Pause

Grad-TTS	UDD (Ours)