TADA! Tuning Audio Diffusion Models through Activation Steering

Steering Audio Examples

TADA teaser: steering vectors applied to audio diffusion model attention layers to control attributes like tempo, mood, and timbre

Abstract

Audio diffusion models can synthesize high-fidelity music from text, yet achieving fine-grained control over specific musical attributes remains challenging, as their internal mechanisms for representing high-level concepts are poorly understood. In this work, we use activation patching to demonstrate that recent audio diffusion architectures exhibit a semantic bottleneck, where a small, shared subset of consecutive attention layers controls distinct musical concepts, such as the presence of specific instruments, vocals, or genres. Building on this, we systematically evaluate a broad spectrum of steering paradigms, comparing activation steering against prompt-level, score-space, and weight-space interventions, analyzing the interaction between the steering mechanism and the intervention site. Our new benchmark, supported by an extensive user study, demonstrates that localized activation steering establishes a new state-of-the-art in audio concept modulation.

Steering Audio Examples

Pick an example, then drag any method's slider to steer the audio in real time. The first row is a baseline picker — swap in different methods to compare against the localized variants.

Steering Piano Example

Baseline · α = 0

Baselines (all layers)

← less pianomore piano →

α = 0

Our localized

AUSteer (loc.)

← less pianomore piano →

α = 0

CAA (loc.)

← less pianomore piano →

α = 0

SAE (loc.)

← less pianomore piano →

α = 0

Steering Vocal Gender Example

Baseline · α = 0

Baselines (all layers)

← malefemale →

α = 0

Our localized

AUSteer (loc.)

← malefemale →

α = 0

CAA (loc.)

← malefemale →

α = 0

SAE (loc.)

← malefemale →

α = 0

Steering Tempo Example

Baseline · α = 0

Baselines (all layers)

← slowerfaster →

α = 0

Our localized

AUSteer (loc.)

← slowerfaster →

α = 0

CAA (loc.)

← slowerfaster →

α = 0

SAE (loc.)

← slowerfaster →

α = 0

TADA! Tuning Audio Diffusion Models through Activation Steering

Abstract

Steering Audio Examples

Steering Piano Example 1 — Slow tempo, peaceful meditation soundscape with a soft violin 2 — Calm Coldplay-style song with slight piano 3 — Latin Spanish music with subtle guitar 4 — Melancholic jazz ballad with smooth saxophone 5 — Slow tempo, peaceful meditation soundscape (v2)

Steering Piano Example