Data Science

Building an Audio Classifier (Part 2)

4 min readAug 15, 2023

I have been sharing my journey of building an audio classifier through Medium posts, so far I have covered how to read audio data, the link to the article is below. Going through it, before you read this article would give you useful context on the data we would be working with as part of this case study.

Building an audio classifier

Part 1: Loading the data and generating labels

gghantiwala.medium.com

With that introduction, let’s get going with today’s article.

An audio file can be represented as amplitude on the y-axis and time on the x-axis. The sampling rate is the information/number of samples per second. The default sampling rate, librosa uses is 22050 Hz i.e. 20KHz because that’s the highest frequency humans can hear. Let’s see how different sampling rates can affect the audio.

Firstly, we will use a sampling rate of 4000 Hz

#A happy sample at a SR = 4000
SAMPLE_RATE = 4000
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_HAP_HI.wav'
audio , sampling_rate = librosa.load(v_path,sr=SAMPLE_RATE)
plt.figure(figsize=(15,5))
librosa.display.waveshow(audio , sr = sampling_rate)
ipd.Audio(v_path,rate=4000)

Output:

Next, let’s try a sampling rate of 1000 Hz

#A happy sample at a SR = 1000
SAMPLE_RATE = 1000
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_HAP_HI.wav'
audio , sampling_rate = librosa.load(v_path,sr=SAMPLE_RATE)
plt.figure(figsize=(15,5))
librosa.display.waveshow(audio , sr = sampling_rate)
ipd.Audio(v_path,rate=1000)

Lastly, let’s take an extreme example of a sampling rate of 10 Hz

SAMPLE_RATE = 10
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_HAP_HI.wav'
audio , sampling_rate = librosa.load(v_path,sr=SAMPLE_RATE)
plt.figure(figsize=(15,5))
librosa.display.waveshow(audio , sr = sampling_rate)
ipd.Audio(v_path,rate=10)

Observations from this exercise:

As we reduce the sampling rate, we notice that the audio becomes blurry, hence we can consider the sampling rate in audio to be like the pixels in an image, or the video quality on Youtube, naturally 1080 px would stream higher quality video than 720 px.

The feature we would be extracting is MFCC.

The full form of MFCC is Mel-frequency cepstral coefficients. It is a frequency domain feature, and you can consider it as an x-ray of your mouth, as in the sound that comes out of your tract. (this analogy is courtesy: https://www.kaggle.com/code/ejlok1/part-2-extracting-audio-features/notebook)

In case you want a detailed explanation of the mathematics behind MFCC, you can refer to this excellent blog: https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd

We will use the librosa library which provides us a method to generate the MFCC feature in a single line. The below code would generate MFCC for happy audio.

SAMPLE_RATE = 22050
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_HAP_HI.wav'
audio , sampling_rate = librosa.load(v_path,sr=SAMPLE_RATE)
mfcc = librosa.feature.mfcc(y=audio, sr=SAMPLE_RATE, n_mfcc = 5)
plt.figure(figsize=(12, 6))
plt.subplot(3,1,1)
librosa.display.specshow(mfcc)
plt.ylabel('MFCC')
plt.colorbar()
plt.show()

Similarly, in the next snippet, we change the audio sample.

SAMPLE_RATE = 22050
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_DIS_LO.wav'
audio , sampling_rate = librosa.load(v_path,sr=SAMPLE_RATE)
mfcc = librosa.feature.mfcc(y=audio, sr=SAMPLE_RATE, n_mfcc = 5)
plt.figure(figsize=(12, 6))
plt.subplot(3,1,1)
librosa.display.specshow(mfcc)
plt.ylabel('MFCC')
plt.colorbar()
plt.show()

We can see a difference between the two MFCC features the first one being a Happy audio and the second one being a disgust emulating audio. Though, I cannot pinpoint what exactly is the difference, as long as we have a difference it can help our classifier to work with them.

Like MFCC, there are many features we can work with, but I want to first build a baseline model with MFCC before I deep dive into the n number of features available for audio data. In the next article, I would work on building the classifier. Stay Tuned!

Data Science

Building an Audio Classifier (Part 2)

Building an audio classifier

Part 1: Loading the data and generating labels

Written by Alifia Ghantiwala