Mel Spectrograms Explained Easily

Valerio Velardo - The Sound of AI

5 chapters7 takeaways10 key terms5 questions

Overview

This video explains Mel spectrograms, a type of audio representation crucial for AI in audio and music. It contrasts Mel spectrograms with standard spectrograms by highlighting how human pitch perception is non-linear. Standard spectrograms use a linear frequency scale (Hertz), which doesn't align with how humans perceive pitch differences, especially at higher frequencies. Mel spectrograms address this by using the Mel scale, a perceptually motivated logarithmic scale, making them more suitable for machine learning models that aim to mimic human auditory processing. The video details the process of converting a standard spectrogram to a Mel spectrogram, including understanding Mel filter banks and their mathematical representation.

How was this?

Save this permanently with flashcards, quizzes, and AI chat

Chapters

Standard spectrograms visualize audio frequency content over time using Hertz on the y-axis.
Human perception of pitch is non-linear; we are more sensitive to frequency changes at lower Hertz.
A constant frequency difference in Hertz (e.g., 200 Hz) sounds like a larger pitch difference at low frequencies (C2-C4) than at high frequencies (G6-A6).
Standard spectrograms, using a linear Hertz scale, do not accurately reflect this perceptual difference in pitch.

Understanding the limitations of standard spectrograms is essential because it reveals why a different representation is needed for tasks involving human perception of sound, such as music analysis.

Comparing the perceived pitch difference between C2 to C4 (200 Hz difference) and G6 to A6 (also a 200 Hz difference), where the latter pair sounds much closer in pitch.

An ideal audio feature should represent time-frequency information and have perceptually relevant amplitude and frequency representations.
The Mel scale is a perceptually motivated scale for pitch, designed to align with human hearing.
The Mel scale is logarithmic: equal distances on the Mel scale correspond to equal perceived pitch differences.
The Mel scale is standardized such that 1000 Hz corresponds to 1000 Mels, and it's derived empirically from psychoacoustic experiments.
The term 'Mel' is an abbreviation for 'melody', reflecting its connection to pitch perception.

The Mel scale provides a frequency representation that better matches human auditory perception, making it a more effective basis for audio features used in machine learning.

The example of 500 Mels to 510 Mels having the same perceived pitch difference as 1000 Mels to 1010 Mels, illustrating the equal perceptual spacing on the Mel scale.

Mel spectrograms are created by applying the Mel scale to the frequency axis of a standard spectrogram.
The process involves three main steps: Short-Time Fourier Transform (STFT), converting amplitude to decibels (logarithmic), and converting frequencies to the Mel scale.
Converting frequencies to the Mel scale involves defining a number of Mel bands (a hyperparameter, often 40-120), computing Mel filter banks, and applying them to the spectrogram.
Mel filter banks are typically represented as triangular filters, where each filter corresponds to a Mel band and has weights between 0 and 1.

Understanding the construction process reveals how a standard spectrogram is transformed into a perceptually relevant Mel spectrogram, bridging the gap between raw audio data and human auditory experience.

The visualization of six triangular filters, where each filter's center frequency is equally spaced in the Mel scale but increasingly spread out in Hertz, demonstrating the Mel filter bank's structure.

Mel filter banks can be represented as a matrix, with rows equal to the number of Mel bands and columns related to the STFT frequency bins.
Applying Mel filter banks to a spectrogram is mathematically achieved through matrix multiplication between the Mel filter bank matrix and the spectrogram matrix.
The resulting Mel spectrogram is a matrix where the number of rows equals the number of Mel bands, and the number of columns equals the number of time frames from the original spectrogram.
The Mel spectrogram visualizes time on the x-axis and perceptually relevant Mel bands on the y-axis, showing the presence of each band over time.

This step clarifies the mathematical operation that transforms a spectrogram into a Mel spectrogram, enabling its use in computational models.

The matrix multiplication between the Mel filter bank matrix (e.g., 6xN) and the spectrogram matrix (NxM) to produce the Mel spectrogram matrix (6xM).

Mel spectrograms are widely used in AI audio and music research due to their perceptual relevance.
They are crucial for applications like audio classification, automatic music recognition, music genre classification, and instrument classification.
By aligning frequency representation with human pitch perception, Mel spectrograms often lead to better performance in machine learning models for these tasks.
The next step involves implementing the extraction and visualization of Mel spectrograms using tools like Python and Librosa.

This chapter emphasizes the practical importance and widespread adoption of Mel spectrograms in modern AI applications, motivating learners to understand and utilize them.

Mention of specific applications like music genre classification and automatic music recognition where Mel spectrograms are overwhelmingly used.

Key takeaways

1Human perception of pitch is non-linear and logarithmic, meaning equal frequency differences in Hertz do not correspond to equal perceived pitch differences.
2Standard spectrograms use a linear Hertz scale, which is perceptually inaccurate for representing pitch.
3The Mel scale is a logarithmic scale that better approximates human pitch perception, making it suitable for audio analysis.
4Mel spectrograms transform the frequency axis of a standard spectrogram from Hertz to the Mel scale.
5The creation of Mel spectrograms involves using Mel filter banks, which are sets of triangular filters designed to map frequencies to the Mel scale.
6Mathematically, converting a spectrogram to a Mel spectrogram is achieved through matrix multiplication with the Mel filter bank matrix.
7Mel spectrograms are a fundamental feature representation in many AI applications for audio and music due to their perceptual relevance.

Key terms

SpectrogramShort-Time Fourier Transform (STFT)Hertz (Hz)Mel ScaleMelPerceptual RelevanceLogarithmic ScaleMel Filter BanksMel BandsMel Spectrogram

Test your understanding

1Why is a linear frequency scale in Hertz problematic for representing human pitch perception?
2How does the Mel scale differ from the Hertz scale in terms of perceptual relevance?
3What are the three main steps involved in creating a Mel spectrogram from an audio signal?
4How are Mel filter banks constructed, and what is their role in creating a Mel spectrogram?
5What is the mathematical operation used to apply Mel filter banks to a spectrogram to obtain a Mel spectrogram?