
Mel Spectrograms Explained Easily
Valerio Velardo - The Sound of AI
Overview
This video explains Mel spectrograms, a type of audio representation crucial for AI in audio and music. It contrasts Mel spectrograms with standard spectrograms by highlighting how human pitch perception is non-linear. Standard spectrograms use a linear frequency scale (Hertz), which doesn't align with how humans perceive pitch differences, especially at higher frequencies. Mel spectrograms address this by using the Mel scale, a perceptually motivated logarithmic scale, making them more suitable for machine learning models that aim to mimic human auditory processing. The video details the process of converting a standard spectrogram to a Mel spectrogram, including understanding Mel filter banks and their mathematical representation.
Save this permanently with flashcards, quizzes, and AI chat
Chapters
- Standard spectrograms visualize audio frequency content over time using Hertz on the y-axis.
- Human perception of pitch is non-linear; we are more sensitive to frequency changes at lower Hertz.
- A constant frequency difference in Hertz (e.g., 200 Hz) sounds like a larger pitch difference at low frequencies (C2-C4) than at high frequencies (G6-A6).
- Standard spectrograms, using a linear Hertz scale, do not accurately reflect this perceptual difference in pitch.
- An ideal audio feature should represent time-frequency information and have perceptually relevant amplitude and frequency representations.
- The Mel scale is a perceptually motivated scale for pitch, designed to align with human hearing.
- The Mel scale is logarithmic: equal distances on the Mel scale correspond to equal perceived pitch differences.
- The Mel scale is standardized such that 1000 Hz corresponds to 1000 Mels, and it's derived empirically from psychoacoustic experiments.
- The term 'Mel' is an abbreviation for 'melody', reflecting its connection to pitch perception.
- Mel spectrograms are created by applying the Mel scale to the frequency axis of a standard spectrogram.
- The process involves three main steps: Short-Time Fourier Transform (STFT), converting amplitude to decibels (logarithmic), and converting frequencies to the Mel scale.
- Converting frequencies to the Mel scale involves defining a number of Mel bands (a hyperparameter, often 40-120), computing Mel filter banks, and applying them to the spectrogram.
- Mel filter banks are typically represented as triangular filters, where each filter corresponds to a Mel band and has weights between 0 and 1.
- Mel filter banks can be represented as a matrix, with rows equal to the number of Mel bands and columns related to the STFT frequency bins.
- Applying Mel filter banks to a spectrogram is mathematically achieved through matrix multiplication between the Mel filter bank matrix and the spectrogram matrix.
- The resulting Mel spectrogram is a matrix where the number of rows equals the number of Mel bands, and the number of columns equals the number of time frames from the original spectrogram.
- The Mel spectrogram visualizes time on the x-axis and perceptually relevant Mel bands on the y-axis, showing the presence of each band over time.
- Mel spectrograms are widely used in AI audio and music research due to their perceptual relevance.
- They are crucial for applications like audio classification, automatic music recognition, music genre classification, and instrument classification.
- By aligning frequency representation with human pitch perception, Mel spectrograms often lead to better performance in machine learning models for these tasks.
- The next step involves implementing the extraction and visualization of Mel spectrograms using tools like Python and Librosa.
Key takeaways
- Human perception of pitch is non-linear and logarithmic, meaning equal frequency differences in Hertz do not correspond to equal perceived pitch differences.
- Standard spectrograms use a linear Hertz scale, which is perceptually inaccurate for representing pitch.
- The Mel scale is a logarithmic scale that better approximates human pitch perception, making it suitable for audio analysis.
- Mel spectrograms transform the frequency axis of a standard spectrogram from Hertz to the Mel scale.
- The creation of Mel spectrograms involves using Mel filter banks, which are sets of triangular filters designed to map frequencies to the Mel scale.
- Mathematically, converting a spectrogram to a Mel spectrogram is achieved through matrix multiplication with the Mel filter bank matrix.
- Mel spectrograms are a fundamental feature representation in many AI applications for audio and music due to their perceptual relevance.
Key terms
Test your understanding
- Why is a linear frequency scale in Hertz problematic for representing human pitch perception?
- How does the Mel scale differ from the Hertz scale in terms of perceptual relevance?
- What are the three main steps involved in creating a Mel spectrogram from an audio signal?
- How are Mel filter banks constructed, and what is their role in creating a Mel spectrogram?
- What is the mathematical operation used to apply Mel filter banks to a spectrogram to obtain a Mel spectrogram?