Stability AI Introduces ‘Stable Audio’ Model for Controllable Audio Generation: A Breakthrough Innovation
Introduction:
Stability AI has introduced “Stable Audio,” a latent diffusion model designed to revolutionize audio generation. This breakthrough combines text metadata, audio duration, and start time conditioning to offer unprecedented control over the content and length of generated audio, including the creation of complete songs.
The Limitation of Traditional Audio Diffusion Models:
- Traditional audio diffusion models struggle to generate audio of fixed durations, often resulting in abrupt and incomplete musical phrases.
- This limitation arises from training the models on randomly cropped audio chunks, which are then forced into predetermined lengths.
The Advancements of Stable Audio:
Stable Audio addresses the above limitation by enabling the generation of audio with specified lengths, up to the training window size.
- Stable Audio uses a heavily downsampled latent representation of audio, which accelerates inference times compared to raw audio.
- The flagship Stable Audio model can generate 95 seconds of stereo audio at a 44.1 kHz sample rate in under a second using an NVIDIA A100 GPU.
The Core Architecture of Stable Audio:
The core architecture of Stable Audio consists of a variational autoencoder (VAE), a text encoder, and a U-Net-based conditioned diffusion model.
- The VAE compresses stereo audio into a noise-resistant, lossy latent encoding, speeding up generation and training processes.
- A text encoder derived from the CLAP model is utilized to incorporate text prompts into the diffusion U-Net, enhancing the model’s ability to understand the relationships between words and sounds.
Unique Conditioning for Length Control:
During training, the Stable Audio model learns to incorporate two key properties from audio chunks: the starting second (“seconds_start”) and the total duration of the original audio file (“seconds_total”).
- These properties are transformed into discrete learned embeddings per second, which are then concatenated with the text prompt tokens.
- This unique conditioning allows users to specify the desired length of the generated audio during inference.
The Diffusion Model and Dataset:
- The diffusion model at the heart of Stable Audio has 907 million parameters and utilizes residual layers, self-attention layers, and cross-attention layers to denoise the input while considering text and timing embeddings.
- Stability AI curated an extensive dataset of over 800,000 audio files, including music, sound effects, and single-instrument stems, totaling 19,500 hours of audio.
- The dataset was furnished in partnership with AudioSparx, a prominent stock music provider.
Continued Research and Future Releases:
Stable Audio represents the forefront of audio generation research and emerges from Stability AI’s generative audio research lab, Harmonai.
- The team at Stability AI is dedicated to advancing model architectures, refining datasets, and enhancing training procedures to improve output quality, control, speed, and achievable output lengths.
- Stable Audio may be the basis for open-source models and accessible training code in future releases from Harmonai.
Conclusion:
Stable Audio introduces a revolutionary approach to audio generation, providing unprecedented control over content and length. With the ability to generate complete songs and accelerated inference times, Stable Audio showcases the power of generative AI in the audio domain.