TADA! Tuning Audio Diffusion Models through Activation Steering

Steering Audio Examples
TADA teaser: steering vectors applied to audio diffusion model attention layers to control attributes like tempo, mood, and timbre

Abstract

Text-to-audio diffusion models have shown impressive capabilities in generating realistic audio from text descriptions, but they often lack fine-grained control over specific audio attributes. We present TADA (Tuning Audio Diffusion with Activation steering), a lightweight method for steering the generation process of pre-trained audio diffusion models by manipulating their internal activations. Our approach identifies concept-specific steering vectors from a small set of contrastive audio pairs and uses them to guide the diffusion process toward desired audio characteristics, such as the presence of specific instruments, vocal qualities, tempo, or mood, without retraining the model. We demonstrate that TADA enables continuous, fine-grained control over multiple audio attributes simultaneously, generalizes across diverse text prompts, and can be combined with existing text-to-audio models as a plug-and-play module. Extensive experiments show that our method achieves effective attribute control while preserving overall audio quality and text alignment.

Audio Examples

Use the slider to select a steering strength (α). For each prompt, compare the four approaches side by side.

Piano

"A melodic jazz piece with smooth rhythms"
−λ No steering
no piano ← → piano
CAA (All layers)
CAA (layers w/o {6,7})
CAA (layers {6,7})
SAE (layer {7})
"Slow tempo, peaceful meditation soundscape with a soft violin"
−λ No steering
no piano ← → piano
CAA (All layers)
CAA (layers w/o {6,7})
CAA (layers {6,7})
SAE (layer {7})

Tempo

"Cinematic soundtrack with dramatic tension"
−λ No steering
slow ← → fast
CAA (All layers)
CAA (layers w/o {6,7})
CAA (layers {6,7})
SAE (layer {7})
"Latin music with percussion and guitar"
−λ No steering
slow ← → fast
CAA (All layers)
CAA (layers w/o {6,7})
CAA (layers {6,7})
SAE (layer {7})

Vocal Gender

"Pop song with catchy vocal melody and synth"
−λ No steering
male ← → female
CAA (All layers)
CAA (layers w/o {6,7})
CAA (layers {6,7})
SAE (layer {7})
"Rock anthem with powerful vocals and electric guitar"
−λ No steering
male ← → female
CAA (All layers)
CAA (layers w/o {6,7})
CAA (layers {6,7})
SAE (layer {7})