Music and AI - Part 3
Last Updated: 19 Nov 2022In the previous post, we looked at how different architectures have been applied to music generation. But this is a bit like putting your cart before the horse, since we haven’t discussed in detail the kind of attributes we’re looking for. This post goes over some high level considerations - things that could serve as requirements for a project. Depending on your specific needs and desires, you may end up selecting a different architecture.
Part 1 - Content Variety
In music, there is not going to be a “best” or most “optimal” solution. This means that we actually do want some way to add randomness into our process.
1.1 - Sampling
Sampling methods work a little like the feedforward decoder, but instead of generating examples which are deterministic based on a seed, examples are randomly generated to match a probability distribution. Some sampling strategies include: Metropolis-Hastings algorithm, Gibbs Sampling, and block Gibbs Sampling.
Another way to think about sampling methods is that they are a form of generative AI that is based on stochastic models which learn probability distributions (variational autoencoders, RBM’s, etc.). For this reason, we talk about them as a different type of solution, but conceptually, they can manifest just like the decoders we’ve already discussed (i.e. a seed is used to randomly generate examples).
With regard to music, there are two ways to apply this type of sampling approach. The first, is in the vertical dimension, (i.e. chord - does the voicing of this chord make sense?). Recurrent Boltzman Machines are effective for this kind of work. The second is in the horizontal dimension (i.e. a melody or sequence of notes - does this sequence of notes make a coherent melody?). This kind of sampling is better achieved by an RNN.
1.2 - Sampling Control through Interpolation
So we mentioned earlier that decoders only allow information to pass through one way, but there’s a way around this: interpolation in the latent space.
The latent space is the space formed by the latent variables. (e.g. if our seed was (8,4,5,2), there are 4 latent variables). We could perform clustering in the latent space to identify patterns to the seeds which produce our training data.
In this sense, we aren’t really feeding information back through the decoder, but rather applying even more machine learning on the seeds to understand the significance of each digit. In other words, given a set of coordinate numbers, we’re learning if they are for a cartesian, polar, or spherical system.
Hadjeres and Nielsen discuss something called geodesic latent space regularization, which is supposed to be better than linear interpolation.
1.3 - Training for Originality
For many skills, copying is easier than creating, and this is certainly true for generative AI as well. How can we bias our architecture to create original content, and not just poor copies? Importantly, we would like this to happen a priori, and not a posteriori.
The first approach involves a sort of ad hoc tuning of parameters within the architecture until we get results that mostly seem original. This can certainly be effective, but maybe lacks some of the intentionality we’re going for.
Another approach is to modify the traditional generative adversarial network. In a traditional GAN, the classifier returns a measure of how easily/confidently it was able to classify the generated content as real or fake. The new approach, as outlined by Elgammal, called a creative adversarial network (CAN), uses these two contradictory forces:
- The same measure in a traditional GAN about how easily the network was able to distinguish between real and fake input.
- The ease with which it is able to classify the input into a learned style.
Thus, the generative part of the network is incentivized to create ouputs that feel real, but also difficult to classify (hence, creative!). Elgammal’s work is primarily applicable to art, and has not yet been applied to music.
Part 2 - More Complex Representations
Thus far we’ve talked pretty simply about music as basically a collection of notes through time, but there are plenty of other details we can think about.
2.1 - Richer Chord Encodings
We introduced sampling as a means to create something from nothing, but sampling is also useful because it is non-deterministic. (i.e. instead of choosing the most probable solution, we can randomly choose a solution based on the probability distribution).
An early application of sampling is the CONCERT Bach Melody generation system by Mozer (1994). The program generates melodies with a chord accompaniment, where the chords are just another aspect of each note, alongside pitch and duration. We’ll discuss each of these in order, starting first with pitch.
-
The pitch of each note has three components, which are the pitch height, the (modulo) chroma circle cartesion coordinates, and the (harmonic) circle of fifths cartesian coordinates. The second two coordinates help capture the similarity of the octave, and the harmonic importance of the fifth respectively.
-
The duration is subdivided into 12, to accomodate both regular divisions as well as triplets. As with the pitch, there are 3 coordinates to the duration, the absolute duration, the 1/3 beat circle, and the 1/4 beat circle.
-
The chords are always structured the same, with a root, a third that can be major or minor, a fifth that can be perfect, augmented, or diminished, and an optional 7th that can be major or minor.
One of the reasons why the CONCERT program uses such a complex representation is because it pre-dates modern deep learning algorithms. Instead of allowing an archicture to learn these relationships, the authors manually design them into the representation.
2.2 - Dynamics
Music played by a human rarely ends up sounding exactly like it is written on the page. Musicians will adjust the dynamics, and take liberties on tempo in an actual performance which are not captured by sheet music. These decisions are often made live, and are based on the musician’s understanding of the music and personal opinion.
One way to capture this information is to work directly with the audio waveforms rather than symbolic representation. Of course, the obvious downside to this strategy is that we lose the symbolic representation. A second alternative is to add this information to our training data as another aspect of the input and output.
In their program, called Performance RNN, Simon and Oore capture MIDI performance data and introduce an additional time shift and dynamic value to each time step. The time shift ranges from 10 ms to 1 sec, and the dynamics are captured by the 128 possible MIDI velocities quantized into 32 bins.
Part 3 - Other Desireable Qualities
3.1 - Control
Stepping back for a moment, let’s think about how we could apply generative AI. The first thing to realize is that most tools are designed to help a real person, and lack direction by themselves. Even the most advanced AI’s - like those that drive your car - still need a destination. So the primary question is: how can we exert control over the result?
One way we’ve already discussed is through sampling. Like the predictive speech capability on your phone, non-deterministic algorithms are important because they provide options. But a big problem with sampling methods is that decoders only work in one direction. There’s no easy way to take the output and feed it back to learn a “better” seed. Luckily, we are not without options.
For RNNs, one option we have is to threshold out low probability options (remember that notes are often predicted with a softmax function). This prevents these a bad note from propagating into futue time steps.
For single step feedforward networks, we can reduce the sampling scope. Instead of generating large chunks of music, we generate shorter phrases, or fewer voices within a larger and constrained piece.
3.1.1 - Conditioning
Conditioning an architecture means to bias it’s output, based on some extra information (typically applied as an additional input layer). For example, this could be a genre label, a chord progression, or maybe even the previously generated note.
3.1.2 - Unit Selection
At its core, unit selection is an interative feedforward network, but instead of writing music note by note, it selects measures from a database, which when concatenated together, form a coherent musical idea. This approach needs metadata to represent the musical quality of each measure but unfortunately, music theory doesn’t really think of music in terms of measures. This means we as humans lack the language to think about what these pieces of metadata might be!
3.2 - Adaptable Networks
Another common problem with networks is that they cannot learn after they’ve been trained. How do you reinforce good patterns, and discourage bad ones? One option is to feed good outputs back into the training pool, but this risks over-fitting, and there is no practical way to punish bad behavior.
In practice, this kind of objective mostly has to be achieved by looking more closely at the generation step, and taking advantage of control strategies (like reinforcement, conditioning, constraints, etc.). Over-time, we can “learn” values for these parameters that improve the ouput.
3.3 - Explainable Networks
The challenge of explaining how AI “thinks” is not unique to music. In general, deep learning architectures are a bit of a black box, but a few basic approaches have been tried to “understand” what patterns each node, or collection of nodes is responding to.
This task gets a lot easier on simple inputs. For instance, in a low rank matrix, we can look at the vectors which span the matrix space, and understand that each vector captures one essential piece of information. But then again, if patterns were easy to spot, we wouldn’t need such complex deep learning archictures to begin with.
- Gradient sensitivity - seeing how changes to the inputs affect the result
- Signal isolation - attempting to isolate particular input patterns to trigger activation in high level neurons
- Attribution - looking at a specific location in the output, and seeing the contributions from each input dimension.
3.4 - Refinement over Time
In the real world, your first draft is rarely your final draft. You’ll go back through and make edits, but taking into account your improved understanding of the big picture.
When applied to music generation and machine learning, this changes the input structure from maybe just a frame or structure, to something more constrained.
DeepBach is one such example, which uses two LSTM’s and two feedforward networks. The two LSTM’s approach from different sides of the music - one from the past and one from the future. A third network handles harmonic information, and the three together are passed into a final network for the output.
Incremental networks like this are great candidates for providing a user interface, and hence, control over the result!