Music and AI - Part 2

Last Updated:

This post goes over a few different approaches to music generation with deep learning.

Part 1 - Single Step Feedforward Learning

These architectures do not ingest music one step at a time, but rather ingest a chunk all at once. While straightforward, a major limitation of this method is that it’s difficult to produce musical phrases of variable length. It’d be like deciding ahead of time how many words are allowed in each sentence.

1.1 - Introduction with Bach

For the first example, we’ll walk through a very simple and intuitive deep learning setup. Here are the parameters:

  1. Music Structure - 4 measures of SATB in 4/4 time
  2. ML Architecture - Feedforward neural network with 1 hidden layer
  3. Input - (One hot encoding for pitch at each time step) x number of input voices
  4. Output - (One hot encoding for pitch at each time step) x number of output voices
  5. Training Data - Bach chorales, with each voice extracted and transposed into all keys.

Even before we consider the musical merit of any generated output, we can notice that there will be intrinsic limitations to this method. First, it is deterministic. For each unique input there will only ever be 1 output. Maybe this ok for something like this, but what if our input was a chord progression and our desired ouput was a melody? Many different melodies are possible for the same progression. Here we run into our first challenge - how can we generate a wide variety of content from limited inputs? (creatio ex-nihilo and content variability)

The next two examples are ways around this issue.

1.2 - Encoder/Decoder

The basic idea here is to first train an autoencoder, which if you remember, learns the important “features” in its hidden layer. To generate unique outputs, we simply turn this around - the hidden layer becomes the input (i.e. a seed). The seed is fed-forward through the decoder, and we get a unique output. Of course this is still deterministic, but it overcomes the creatio ex-nihilo issue. (Technically we’re not starting from true nothing, but then what does? Most randomly generated content, like a minecraft world, starts from a seed anyway).

The DeepHear system by Sun uses this approach to generate ragtime music. They use Scott Joplin songs as the training data, and a stacked autoencoder with a hidden layer size of 16. Due to the small size of this hidden layer, there is a high amount of plagiarism (around 60%).

Part 2 - Sequential Models

These architectures treat the data sequentially. Once advantage of structuring a model this way is that you can produce music of variable length. Most are implemented with some kind of RNN.

2.1 - A Basic RNN with LSTM

An example of this sequential approach was implemented by Eck and Schmidhuber in their 12 bar blues program. The input and output layer both have 25 nodes (13 for notes and 12 for chords) with a hidden layer of eight LSTM blocks. The hidden layer connections are as follows:

  1. Chord block is fully connected to everything in the input and the chord output.
  2. Melody block is fully connected to everything in the input, and melody output.
  3. Chord block is recursive with itself and melody.
  4. Melody block is recursive only with itself.

The asymmetric architecture is used to control the dependency of chords and melody. In this case, the chords are dependent on the melody, but not vice-versa.

2.2 - Bi-Directional LSTM for Accompaniment

A Bi-directional RNN uses data from both before and after the target timestep to make predictions. This makes sense when generating music, since motifs or themes are often repeated.

One such way to use a BLSTM is to generate a progression of chords as an accompaniment to a melody.

2.3 - RNN/RBM Polyphany Generation

Ok, we’ve introduced a few architectures, but we haven’t really factored in anything about music yet. Music theory offers many ideas through which we can understand the structure and shape of songs, so why not incoporate them into our architecture?

To begin, let’s start with the basic differentiation of melody and harmony. From a music theory perspective, there are two dimensions to a song: the vertical and the horizontal. Vertical is the pairing of melody with harmony, while horizontal is the change from one time step to the next.

The idea of a coupled RNN-RBM is to cover both of these aspects:

This strategy designs an archictecture based on an understanding of music theory (i.e. that melody and harmony support each other). In this way, it is unique to some of the other strategies that have been outlined, which begin from other perspectives.

Question for you: if music has two dimensions, that makes it just like an image right? Why can't we just apply existing image-based AI algorithms to a piano roll? The surprisingly intuitive answer is that in music (and unlike in images), the vertical and horizontal dimensions are not equivalent. In a picture, both dimensions are length, but in a piano roll, one dimension is pitch and the other is time. That being said, Convolutional Neural Networks - which have been very effective for image processing - are not useless for music generation... we just have to find creative ways to apply it.

2.4 - RNN/RNN Polyphany Generation

This is nearly identical to the architecture in example 7, but the vertical dimension is modeled by another RNN, instead of an RBM.

For an example of this implementation, see the Bi-Axial LSTM archicture proposed by Daniel Johnson “Generating Polyphonic music using tied parallel networks” (2017).

Part 3 - Hybrid Approaches

Although RNN’s are great for sequential data, music actually tends to have a bit more structure than language. Songs are not one monolithic piece from start to finish, but rather are organized into sections, often with recurring motifs (e.g. AABA, Sonata, Etude, etc.).

This kind of repeating structure is actually really well captured by things like convolutional neural networks or encoders. So can we get the best of both worlds by using both a sequential and single-step architecture?

3.1 - Variational Recurrent Autoencoder

Often, structure is intentionally withheld from music writing AI’s because it’s very difficult to learn high level structures simultaneously with low-level harmony and melodic style.

MusicVAE is a variational recurrent autoencoder with a 2-level heirarchical RNN within the decoder. The introduction of an additional level improves the algorithm’s ability to learn structure. This allows it to generate longer sequences of music with better overall cohesion.

The first level within the hierarchy deals with low level concepts, like a musical phrase, while the second (higher) level deals with musical structure. (Technically, there is nothing that forces this association, but the hope is that by designing the archicture this way, the network will also learn this concept when it is trained).

Part 4 - Miscellaneous

4.1 - Style Transfer

Style transfer is an active field of image processing, but hasn’t been explored very much in music. Musicians do this all the time in real life, (e.g. rearranging a pop song into a waltz, copying a jazz pianists style, or remixing a song into lo-fi). Conceptually, we can think about two types of style transfer:

  1. Composition structure (e.g. time signature, song structure)
  2. Performance style (e.g. different style of harmony, embellishments. Think how different jazz pianists can approach the same standard in dramatically different ways)

Conclusion

And that’s it! A quick overview of different architectures that researchers have used to tackle music generation. In the next post, we’ll take a look at specific challenges and more practical considerations for if you wanted to deploy a production model.

Music and AI - Part 2 | Notes