Data Science

Building an Audio Classifier (Part 2)

Alifia Ghantiwala
4 min readAug 15, 2023
Photo by C D-X on Unsplash

I have been sharing my journey of building an audio classifier through Medium posts, so far I have covered how to read audio data, the link to the article is below. Going through it, before you read this article would give you useful context on the data we would be working with as part of this case study.

With that introduction, let’s get going with today’s article.

An audio file can be represented as amplitude on the y-axis and time on the x-axis. The sampling rate is the information/number of samples per second. The default sampling rate, librosa uses is 22050 Hz i.e. 20KHz because that’s the highest frequency humans can hear. Let’s see how different sampling rates can affect the audio.

Firstly, we will use a sampling rate of 4000 Hz

#A happy sample at a SR = 4000
SAMPLE_RATE = 4000
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_HAP_HI.wav'
audio , sampling_rate = librosa.load(v_path,sr=SAMPLE_RATE)
plt.figure(figsize=(15,5))
librosa.display.waveshow(audio , sr = sampling_rate)
ipd.Audio(v_path,rate=4000)

Output:

Image by Author

Next, let’s try a sampling rate of 1000 Hz

#A happy sample at a SR = 1000
SAMPLE_RATE = 1000
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_HAP_HI.wav'
audio , sampling_rate = librosa.load(v_path,sr=SAMPLE_RATE)
plt.figure(figsize=(15,5))
librosa.display.waveshow(audio , sr = sampling_rate)
ipd.Audio(v_path,rate=1000)
Image by Author

Lastly, let’s take an extreme example of a sampling rate of 10 Hz

SAMPLE_RATE = 10
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_HAP_HI.wav'
audio , sampling_rate = librosa.load(v_path,sr=SAMPLE_RATE)
plt.figure(figsize=(15,5))
librosa.display.waveshow(audio , sr = sampling_rate)
ipd.Audio(v_path,rate=10)
Image by Author

Observations from this exercise:

As we reduce the sampling rate, we notice that the audio becomes blurry, hence we can consider the sampling rate in audio to be like the pixels in an image, or the video quality on Youtube, naturally 1080 px would stream higher quality video than 720 px.

The feature we would be extracting is MFCC.

The full form of MFCC is Mel-frequency cepstral coefficients. It is a frequency domain feature, and you can consider it as an x-ray of your mouth, as in the sound that comes out of your tract. (this analogy is courtesy: https://www.kaggle.com/code/ejlok1/part-2-extracting-audio-features/notebook)

In case you want a detailed explanation of the mathematics behind MFCC, you can refer to this excellent blog: https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd

We will use the librosa library which provides us a method to generate the MFCC feature in a single line. The below code would generate MFCC for happy audio.

SAMPLE_RATE = 22050
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_HAP_HI.wav'
audio , sampling_rate = librosa.load(v_path,sr=SAMPLE_RATE)
mfcc = librosa.feature.mfcc(y=audio, sr=SAMPLE_RATE, n_mfcc = 5)
plt.figure(figsize=(12, 6))
plt.subplot(3,1,1)
librosa.display.specshow(mfcc)
plt.ylabel('MFCC')
plt.colorbar()
plt.show()
Image by Author

Similarly, in the next snippet, we change the audio sample.

SAMPLE_RATE = 22050
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_DIS_LO.wav'
audio , sampling_rate = librosa.load(v_path,sr=SAMPLE_RATE)
mfcc = librosa.feature.mfcc(y=audio, sr=SAMPLE_RATE, n_mfcc = 5)
plt.figure(figsize=(12, 6))
plt.subplot(3,1,1)
librosa.display.specshow(mfcc)
plt.ylabel('MFCC')
plt.colorbar()
plt.show()
Image by Author

We can see a difference between the two MFCC features the first one being a Happy audio and the second one being a disgust emulating audio. Though, I cannot pinpoint what exactly is the difference, as long as we have a difference it can help our classifier to work with them.

Like MFCC, there are many features we can work with, but I want to first build a baseline model with MFCC before I deep dive into the n number of features available for audio data. In the next article, I would work on building the classifier. Stay Tuned!

--

--