Data Science

Building an Audio Classifier (Part 1)

Get started from scratch

Alifia Ghantiwala

4 min readAug 7, 2023

As part of this article, we would try to build an audio classifier. The steps for the same would be:

1) Data exploration

2) Feature Extraction

3) Model building.

Each of these steps would be covered in separate articles, since this is part 1 we would work on data exploration in this article.

This is my early attempt at working with audio data and this work is heavily inspired by: https://www.kaggle.com/code/ejlok1/audio-emotion-part-1-explore-data

I will try to note down my learnings along with the code, if you like the article please do upvote! :)

With the introduction aside, let’s get to work.

We’ll start by loading the Python libraries we would need as part of the notebook. I’ll provide info about the libraries as we use them in code.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import seaborn as sns
import librosa
import librosa.display
import matplotlib.pyplot as plt
import IPython.display as ipd

Information about the data (from their official GitHub page: https://github.com/CheyneyComputerScience/CREMA-D)

CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset) Summary CREMA-D is a data set of 7,442 original clips from 91 actors. These clips were from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races and ethnicities (African American, Asian, Caucasian, Hispanic, and Unspecified).

Actors spoke from a selection of 12 sentences. The sentences were presented using one of six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) and four different emotion levels (Low, Medium, High, and Unspecified).

Participants rated the emotion and emotion levels based on the combined audiovisual presentation, the video alone, and the audio alone. Due to the large number of ratings needed, this effort was crowd-sourced and a total of 2443 participants each rated 90 unique clips, 30 audio, 30 visual, and 30 audio-visual. 95% of the clips have more than 7 ratings.

#Looking at the file paths.
PATH = "/kaggle/input/cremad/AudioWAV"
dir_list = os.listdir(PATH) #lists the files within a directory/folder
dir_list.sort()
print(dir_list[0:10])

Output:

The data is not explicitly labeled, but if you look at the file paths you’d see they mention the emotion, for example in the file: ‘1001_DFA_ANG_XX.wav’ we see that it has ANG within the file path, which means it represents the emotion of anger. We’ll leverage this information to generate the labels below.

#Let's generate labels
emot_labels = []
path = []
for i in dir_list:
    part = i.split('_')
    path.append(PATH + i)
    if part[2] == 'ANG':
        emot_labels.append('angry')
    elif part[2] == 'DIS':
        emot_labels.append('disgust')
    elif part[2] == 'FEA':
        emot_labels.append('fear')
    elif part[2] == 'HAP':
        emot_labels.append('happy')
    elif part[2] == 'NEU':
        emot_labels.append('neutral')
    elif part[2] == 'SAD':
        emot_labels.append('sad')
    else:
        emot_labels.append('unknown')

Next, we create a data frame that has one column of the file path and the other of the corresponding label.

data = pd.DataFrame(emot_labels,columns=['labels'])
data = pd.concat([data,pd.DataFrame(path,columns=['path'])],axis=1)
data.labels.value_counts()

Let’s listen to some samples now using the librosa library. Librosa package in Python is widely used for audio analysis. We would also display the waveform of the audio using the IPython display library and the librosa.display package.

# An angry sample
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_ANG_HI.wav'
audio , sampling_rate = librosa.load(v_path)
plt.figure(figsize=(15,5))
librosa.display.waveshow(audio , sr = sampling_rate)
ipd.Audio(v_path)

I was not able to link the audio output here, but you could clone my notebook and run it to see the results.

The output would look something like this.

Let’s load another sample, this time in a different emotion.

#A happy sample
v_path = '/kaggle/input/cremad/AudioWAV/1001_IEO_HAP_HI.wav'
audio , sampling_rate = librosa.load(v_path)
plt.figure(figsize=(15,5))
librosa.display.waveshow(audio , sr = sampling_rate)
ipd.Audio(v_path)

Observations from the data:

Listening to the above two samples of the same sentence in two different emotions we can clearly hear the difference, also the waveforms are different. It would be interesting to see what features we can extract from this data.

The audio is only for 2 seconds, at has some silence at the start and end of the audio.

Link to my notebook which you can run to replicate the results.

https://www.kaggle.com/code/aliphya/building-an-audio-classifier-pt-1

I will continue working on the next parts of the article and link them here.

Happy coding! Toodaloo!

Data Science

Building an Audio Classifier (Part 1)

Get started from scratch

Written by Alifia Ghantiwala