Music and AI - Part 1

Last Updated: 19 Mar 2022

DallE-2 and ChatGPT have come and shown us how powerful AI can be when it comes to abstract and creative tasks. If AI can create images and sentences, why not music?

There are two ways in which artifical intelligence has been applied to music so far. The first is called music information retrieval, or MIR for short. These programs produce data from audio (e.g. song identification, transcribing audio into sheet music). The second is known as generative AI. These programs try to produce new and original music, either fully autonomously or as a predictive aid to composers (i.e. like text predits on your phone!).

A third application which is not unique to music is song recommendation and classification. This includes apps like Spotify or Pandora which create custom playlists for you, and is very similar to how Netflix recommends you new movies and shows.

Projects and Companies

Just a few practical resources to look at if you are interested in seeing who is currently working on this subject.

AI Music: Companies and Applications to Know
Magenta - an open source, Google-led project to explore creatives applications for ML.
BebopNet - an architecture that produces bebop style jazz solos.
deepJazz - a hackathon project that writes jazz for an ensemble.

Music Information Retrieval (MIR)

Chord recognition, song identification, etc. More information may follow, but this post primarily focuses on generative AI.

Generative AI

Stepping beyond recognition, generative AI tries to create new music. Algorithms in this category range from fully autonomous composers, to assistant bots which make recommendations.

For a comprehensive review of this topic, I recommend reading “Deep Learning Techniques for Music Generation – A Survey”, by Jean-Pierre Briot, Gaetan Hadjeres, and Francois-David Pache.

Symbolic vs Non-symbolic

Symbolic - harmony, chords progressions
Non-symbolic - sound, timbre

Algorithms working with symbolic music often needs rules, grammers and structured inputs, which reduces flexibility and increases overhead.

Markov Models

Markov models predict one step at a time. This makes them easy to implement and understand, but it also means they are unable to think about long-term structure. Think of this like writing a book, but each word is written by finding the most statistically probable word to follow from the one before. You might still get complete sentences, but highly unlikely to get a coherent theme.

Attributes

These are a few attributes of different generative AI’s. Keep in mind that some of these might be related, (i.e. they may not be orthogonal, or independent of one another).

Objective: What aspect of music are you creating? A melody? A chord progression? An accompaniment?
Representation: How are you structuring your musical data? E.g. as a signal, MIDI, text, piano roll.
Architecture: Recurrent NN, autoencoders, generative adversarial modeling, etc.
Challenge: What qualities are you looking for in what you create? Variability, creativity, interactivity.
Strategy: Strategy is how you’re going to convert the input into the output. Does it happen all at once? Note by note?

Objectives

Let’s think about melody for a moment. What’s the context for the melody we’re writing? Is it a Bach style counterpoint to an existing melody? Is it the baritone part in a SATB choir? Is it an improvised melody over an existing chord progression (like on a lead sheet?).

These are all valid ways to think about single-voice melodies, but are different enough that they may require unique approaches.

Representation

Representation is how we choose to format our music. Is it the raw waveform? Maybe a spectrum if we first run a fourier analysis. Or what about symbolically? If so, what about enharmony? Is an Eb and the D# the same thing?

Formats that deal with audio signals are often continuous, and fall into the realm of signal processing. In contrast, symbolic representation is discrete, and more often found in the field of knowledge representation. We’ll start by discussing different elements of music and how we might represent them, and then talk about existing representations.

Note Representations

Every note has a pitch, duration, and dynamic. Pitch can be expressed as a frequency, position on a score, or a letter (A through G). Duration can be measured in absolute terms (seconds), or in the context of written music (quarter note, half note, etc.). The same absolute vs relevant distinction can be made for the note’s dynamic, (i.e. decibels vs pianissimo/forte).

Chord Representations

Chords can be represented either as a collection of notes, or by a root and type (e.g. C7, Gmin, etc.). The latter is more useful for harmonic analysis, and for some applications, like a jazz musician’s lead sheet, could be more useful.

Rhythm Representations

In some sense, the rhythm is already captured by the duration and dynamic of each note. But we can also think of rhythm as the “feel” of a piece. Repeating patterns in the duration and dynamic of each note create a combined rhythm effect. This collective sense of rhythm is distinct from the detailed implementation of time.

Time Representations

There are two approaches to how we create music with respect to time (and if you think about it, there can only be two). You can either model and optimize the entire score all together, or you can step through one unit at a time. This unit can be a beat, or a note. In the latter case, there is no going back to change work you’ve already done. The nuance here seems small, but it has a very big impact on what the input and output to our AI is.

This type of modeling works if we choose to represent a relative duration (e.g. an 8th note, 16th note, etc.) or in absolute terms (in milliseconds). The latter is needed to capture the “artistic” interpretation that a human musician may add, but is not as useful if our goal is to create written sheet music.

Method 1 - MIDI

MIDI (Musical Instrument Digital Interface) is a standard to represent musical information. MIDI works especially well for capturing single note melodies. An example line of MIDI input:

[96, on, 0, 60, 90] [184, off, 0, 60, 0]

We can understand the above lines as saying:

at time 96 and on channel 0, start playing the pitch 60 at a volume of 90.
at time 184 and on channel 0, stop playing the pitch 60 at a velocity of 0.

When specifiying an off event, the very last number tells us how quickly to release a note.

To create a chord, or any two notes played simultaneously, we just introduce a new channel. This is sufficient for capturing audio information, but in the realm of data science, this lack of unique language for chords is problematic. Algorithms which learn using a MIDI format often fail to “learn” harmonies.

Method 2 - Piano Roll

Unlike MIDI, a piano roll does capture chords in a more intuitive way. There are only two dimensions: pitch, and time. A third dimension can be added to collect volume and intensity information. As with the MIDI format, the piano roll doesn’t distinguish between melody and harmony.

What About Held Notes?

One major downside of the piano roll is that (depending on how we encode time), it may be impossible distinguish a held note from a repeated note. For this, we’ll need to add a new word to our musical language. One way to do this is to introduce a new notation (something like a –), which represents a held note. Then when we need to encode this, we can treat the the hold symbol as just another “pitch”.

Metadata

What kind of metadata do we want to include? Key signature, meter (i.e. waltz, bossa nova, swing). Music annotating programs like musescore will often allow users to track these. Whether or not they are useful towards generation remains to be proven.

Encoding

Encoding is the step that connects the representation of music we’ve chosen, to the set of input notes or input variables for the nueral network. For example, consider the two definitions we have for chords:

Implicit (C, E, G, B)
Explicit (C major 7th)

An implicit representation is typically encoded with many-hot, while an explicit representation can be encoded with a multi-one-hot (i.e. the first one-hot is the root, the second one-hot is the quality).

And finally, a musical rest could be encoded like any other note - except in the case of MIDI which doesn’t need it.

Conclusion

And that wraps up the introduction! Hopefully this article leaves you with a top level overview of the subject. We started with the different aspects of music (melody, harmony, rhythm, etc.) and discussed how they connect to machine learning. Before continuing on to part 2, I recommend checking out the introduction to machine learning and machine learning posts.

Notes