Google Has Impressive New Music AI. And No, You Can’t Have It.

MusicLM is Google's new AI model capable of generating rich music samples based on text prompts and melody input. But it won't be releasing to the public yet.

Feb 03, 2023

Good [appropriate time of day], wherever you are!

I tend to mostly write about text-to-image AI, because pictures are static and I have the attention span of a…uh…what were we talking about?

But generative AI is making huge strides in other areas, including music.

I briefly mentioned OpenAI’s MuseNet and Harmonai (Dance Diffusion) in this post from October 2022. In late December, I wrote about Riffusion.

Well, they can all step aside, because there’s a new ‘Riff in town.

The sheriff’s name is MusicLM. It’s a music model made by Google that can compose entire tracks in seconds.

Let’s take a closer look at what MusicLM is capable of…and why we won’t get to play with it in the foreseeable future.

What is MusicLM?

It’s an AI algorithm that can compose music based on different types of input, including text prompts and melody samples.

You can check out the many examples right here. If you’re interested in the background research (nerd), you can read this paper in PDF format:

View Paper [PDF]

But I’ll go ahead and highlight some of the key points.

What can MusicLM do?

It can create music. Duh.

Thanks for reading!

Jokes aside, MusicLM has an impressive range of abilities, like:

1. Generate audio from text prompts

Basically, MusicLM does for music what Midjourney and Stable Diffusion do for images. It can make instrumental arrangements from descriptive prompts.

So it can take a complex set of directions like this:

“A fusion of reggaeton and electronic dance music, with a spacey, otherworldly sound. Induces the experience of being lost in space, and the music would be designed to evoke a sense of wonder and awe, while being danceable.”

…and turn it into a soundtrack like this:

1×

0:00

-0:30

Check out more examples

2. Stitch prompt sequences into a coherent whole

MusicLM can also create music medleys where parts of the prompt fuse seamlessly with each other. Here’s a sample time-stamped prompt:

time to meditate (0:00-0:15)
time to wake up (0:15-0:30)
time to run (0:30-0:45)
time to give 100% (0:45-0:60)

…and here’s the result:

1×

0:00

-1:00

Check out more examples

3. Mimic instruments, genres, and more

Google’s model has a solid understanding of instruments…

“harp”:

1×

0:00

-0:10

…genres…

“west coast hip hop”:

1×

0:00

-0:10

…and can even reproduce the differences in a performer’s skill level:

“beginner piano player”:

1×

0:00

-0:10

“crazy fast professional piano player”:

1×

0:00

-0:10

Check out more examples

4. Combine text prompts with melody input

MusicLM also lets you hum, whistle, or otherwise input a melody, then combine it with a text prompt to assign instruments or genres.

So you could hum “Ode To Joy”:

1×

0:00

-0:10

Then use a text prompt “a capella chorus” to get this:

1×

0:00

-0:09

Or “electronic synth lead” to get this:

1×

0:00

-0:09

Or my personal favorite, “jazz with saxophone”:

1×

0:00

-0:09

Check out more examples

How is MusicLM different from other models?

While music-making AI isn’t new, Google’s MusicLM appears to excel in a few areas where the other models are lacking:

Output quality: Tracks from Google’s model have a 24 kHz sampling rate. That puts it somewhere between cassette tapes and streaming music.
Length: MusicLM can generate tracks of up to several minutes in duration without losing coherence.
Input comprehension: MusicLM responds quite accurately even to lengthy prompts with multiple details (see the very first example)

Sadly, the most impactful difference is that—while you can test Riffusion and MuseNet for yourself—Google’s MusicLM isn’t available to the public.

Why can’t we have it?

“Because it’s Google, man!” is essentially what most responses tend to be.

It’s true that Google is generally cautious with its AI research and hesitant to release algorithms into the wild prematurely.

In the case of MusicLM, the model has a bunch of limitations:

It “misunderstands negations,” so asking it to avoid cowbell will probably just give you more cowbell. #lifehack
It’s not great at "temporal ordering,” so it might arrange music segments haphazardly rather than in the desired sequence.
It has the same gibberish output as other models when mimicking speech.

But the two main reasons Google researchers don’t think MusicLM is ready for prime time are:

Bias: MusicLM’s output will reflect any cultural biases present in the training data, overlooking underrepresented cultures.
Creative content misappropriation: MusicLM may end up “stealing” specific sequences from copyright protected material in the training data. Researchers stress that even instances of “approximate match” in output are just 1% and exact copying is a “tiny fraction” of observed examples. Still, you don’t want a model out there that accidentally inserts catchy “Gangnam Style” hooks into its music tracks.

As it stands, all we have to go on are the neat examples Google has shared. Let’s see how long it takes for us to have a comparable music model that’s publicly available.

Over to you…

Have you checked out any of the AI models mentioned above? Other ones? What’s been your experience?

If you have any cool music AI tools to share, I’m all ears. Send me an email or comment below this post.

See you next week!

stillhooman

Feb 3, 2023

Great post! The text prompts seemed like they certainly have a strong degree of "interpretation" (just like image AI) but I found the melody conditioning particularly fascinating. The examples had a wide array of outputs but the melody itself sounded pretty spot on with each one.

It's easy to imagine models like this being incorporated into gaming. Variables like health status or combat status etc. could easily be fed into the soundtrack on the fly to make things more tense for example.

Expand full comment

1 reply by Daniel Nest

1 more comment...

Why Try AI

Discussion about this post