The Goal: Streamlining Soundtrack Creation with AI

The primary objective of this PoC was to validate whether generative AI technologies can automate key aspects of the soundtrack creation workflow. Traditionally, producing film music has involved expensive recording sessions with professional musicians. By exploring AI as an alternative, the PoC sought not only to reduce time and cost but also to push creative boundaries in film music composition. This marks a significant leap toward democratizing music production, enabling filmmakers to create high-quality soundtracks without the hefty budgets typically required.

Methodology for AI Music Generation

Auto-Regressive Approach

One approach to solving the problem is based on Auto-regressive generation. With this approach generated acoustic tokens are converted into short chunks of waveform using a Residual Vector Quantized Variational Auto-Encoder (RVQ-VAE). New music is generated by generating new latent tokens with a transformer-based model and feeding them through the VAE decoder.

Case Study: MusicGen

In MusicGen (Copet, 2023) the language model generates one token at a time, and 4 tokens (residual codebooks) are converted into the latent space entry. The language model is conditioned on the text prompt using cross-attention, where the text is encoded using the pre-trained T5 model. A supplementary conditioning can be achieved by passing a melody (.wav file). In this case, the model extracts the chromagram (main melody), encodes it, and uses it to start the acoustic token generation.

Diffusion Approach

Recent successes of diffusion models in image generation (Dall-E, Midjourney) look promising and allow researchers to look forward to using it in the audio area as well. Compared to the Auto-regressive architecture, the diffusion models are computationally more efficient.

Two key approaches of diffusion models are:

  1. Spectrogram to Waveform Conversion:
  • A spectrogram is generated via diffusion and then converted into a waveform using a neural vocoder.
  1. Latent Diffusion:
  • Audio is first encoded into a latent representation using a Variational Autoencoder (VAE).
  • Diffusion denoising is applied to the latent space.
  • The latent representation is then decoded back into a waveform, producing the final audio.

Case Study: Stable Audio

A noteworthy example is Stable Audio, which uses cross-attention mechanisms in its denoising U-net architecture. This allows the system to condition generated audio based on both user prompts and timing information. 

However, for this PoC, MusicGen was selected due to its ability to handle both text and melody conditioning. MusicGen can generate music segments of up to 30 seconds, making it an ideal candidate for this experiment.

Conditioning Signals: Key to Music Alignment

To ensure the generated music aligns with user expectations, the models rely heavily on conditioning signals. One of the most promising aspects of creating AI-based soundtracks is the ability to tailor music to specific scenes or emotional tones. Directors and composers may soon be able to tune AI-generated music in real-time, specifying mood descriptors such as “melancholic piano with subtle string accompaniment” or “optimistic '80s synth-pop with a driving bass line.” This could drastically reduce iterative exchanges between composer and director, speeding up the production process.

Conditional signals include:

  • Text Prompts: These define the music style and genre, such as "an 80s pop rock song with a strong guitar riff."
  • Melody Conditioning: Models like MusicGen and ControlMusicNet generate music based on a melody, ensuring adherence to a given melodic structure.
  • Advanced Conditioning: Additional factors, such as rhythm, volume, and timestamps (as seen in Stable Audio), allow more control over the segmentation of the track (intro, main body, outro).
Potential for Personalization & Customization

One of the most promising aspects of AI-driven soundtrack creation is the ability to personalize the output according to specific project needs. For instance, soon filmmakers may be able to easily tweak the mood, instrumentation, or timing of a score to better suit the pacing or emotional tone of a scene. This opens up unprecedented possibilities for creating unique soundtracks that are tailored to individual creative visions.

The Costs of Training AI Models

Training AI models, particularly for music generation, comes at a high cost. Typically, the training process involves large amounts of raw audio data and conditioning signals, which require powerful computers and extended processing times.

For example, training ControlMusicNet costs around $10,000 just to generate six-second audio clips. Similarly, MusicGen – which was trained on 20,000 hours of music using 64 A100 GPUs, despite the length of training, the cost could probably exceed $20,000.

To put these costs in perspective, hiring a live orchestra for a full film score can easily cost between $25,000 and $50,000 for a single session, depending on the size and complexity of the project. Although the upfront costs of AI training are significant, the potential savings in time and repetitive tasks could outweigh these expenses, especially for long-term or large-scale productions.

Development of the PoC Application

The PoC was developed in two key stages:

  1. MIDI Extension:

To start, short MIDI snippets, representing the composer's idea, were extended using the Allegro Music Transformer. For those unfamiliar, MIDI (Musical Instrument Digital Interface) is a digital standard that allows instruments and computers to communicate, often used to create the structure or blueprint of a song, such as its melody or rhythm. In this PoC, the original MIDI files were essentially sketches, which were then extended to create more complete compositions. After extending these snippets, they were converted into .wav files (a high-quality audio format) for further processing.

  1. Music Generation:

In the next stage, MusicGen was used to generate 30-second music pieces based on these .wav files and user-provided text prompts. In simple terms, MusicGen focuses on creating melodies directly from user inputs, such as text descriptions. This makes MusicGen ideal for generating catchy, melody-driven pieces.

Dataset

In the first stage, the models were used as-is (i.e. pretrained versions). The input MIDI files were sourced from the Internet - three files were chosen, Coco Jamboo by Mr. President, Rhiannon by Fleetwood Mac, and Bracka by Grzegorz Turnau due to their distinct melodies at the start of the song. Each MIDI was sourced from the Internet from open-source websites. The three songs represent very different music styles and allow us to investigate how well the two models behave when presented with different musical ideas.

Experimentation & Findings

The experimentation phase involved two primary tests:

Experiment #1: Evaluating MusicGen

This experiment involved tweaking different parameters such as temperature and top_k sampling as well as conditioning only on text prompts. While the default model parameters generally yielded the best results, there were challenges in achieving strict adherence to the original melody. In the case of text conditioning, the generated music was quite good. It contained the style from the prompt and the composition was nice to listen to.

Findings: Melody conditioning requires further refinement to improve accuracy. Text conditioning is better in its current state.

Experiment #2: MIDI Extension with Allegro Music Transformer

This test used multi-instrument tracks to assess the model’s performance in extending MIDI files. The model tried to continue the composition, based on the first 30s, and generated 4 different possibilities.

Findings: While the model performed adequately for single-instrument tracks like piano, the quality diminished for more complex, multi-instrument compositions.

Conclusions & Future Directions

This PoC has demonstrated that AI-driven music generation holds significant promise for film soundtracks, but challenges remain. 

Key insights from our experiments include:

  1. MIDI Extension: AI models can successfully extend MIDI-generated music pieces.
  2. High-Quality Music Generation: Generative models are capable of producing music whose style aligns well with user-specified text prompts.
  3. Conditioning by melody: This area still requires improvement to meet the high standards of professional music production.

Looking ahead, the next steps involve refining melody conditioning techniques and further fine-tuning models with specific datasets. The long-term vision is to advance AI technologies to a point where they can rival, or even surpass, human-created compositions, ultimately transforming the way music is produced for films. This PoC is just the beginning of an exciting journey toward a future where AI plays a central role in shaping the soundtracks of tomorrow’s cinematic experiences.

Beyond the Movie: Expanding the scope of AI-generated music

While this PoC focuses on movie soundtracks, the broader applications of AI-generated music are vast. From gaming soundscapes to advertising jingles, artificial intelligence has the potential to disrupt many industries by providing cost-effective, personalized, and scalable music production. As technology matures, we can expect artificial intelligence to play a key role not only in film but also in the future of audio production in many creative fields.

The journey toward AI-assisted music creation is just beginning. The possibilities for personalization, accessibility, and creativity are vast – and we are only scratching the surface of what is to come.

Curious to explore the topic a bit more? Don't hesitate to contact us!

---

Check out the results of our experiments:

1. Baseline audio file.

2. Prompt-conditioned (“A soundtrack to a Netflix show featuring only a single violin”) file.