Building an audio classifier (Part 3)

Alifia Ghantiwala
2 min readAug 30, 2023
Photo by Daniel Schludi on Unsplash

If you have not had a chance to go through the previous articles of this series, I am working on building an end-to-end audio classifier, in the preceding articles I have covered, an introduction of the data we would be working with and feature extraction. As part of this article, we will be working on building a baseline model that would work on the features we have generated.

Ready to roll?

In the last article, we created MFCC features for some audio samples, we will now do it for all samples in our data frame. As we saw MFCC features are of the shape (n_mfcc, number of time frames divided by hop length). Since our data has 7442 samples our feature set would have a shape of (7442,216)

#Create a new datafrane for MFCC features.
df = pd.DataFrame(columns=['fea'])
#iterate over the entire dataset to calculate mean of MFCC features.
counter = 0
for index,path in enumerate(data.path):
audio , sampling_rate = librosa.load(path
,duration=2.5
,sr=44100
,offset=0.5)
sampling_rate = np.array(sampling_rate)
mfcc = np.mean(librosa.feature.mfcc(y=audio, sr=sampling_rate, n_mfcc = 13),axis=0)
df.loc[counter] = [mfcc]
counter = counter + 1
print(len(df))
df.head()
df = pd.concat([data,pd.DataFrame(df['fea'].values.tolist())],axis=1)
#Fill empty values with 0
df = df.fillna(0)
print(df.shape)

Next, as per standard model training practices, we would split the dataset into train and test sets and normalize the data to avoid some features from overpowering the others.

#Splitting data to train & test, normalising the data
X_train, X_test, y_train, y_test = train_test_split(df.drop(['path','labels'],axis=1)
, df.labels
, test_size=0.25
, shuffle=True
, random_state=42
)
X_train.shape
mean = np.mean(X_train,axis=0)
std = np.std(X_train,axis=0)
X_train = (X_train - mean)/std
X_test = (X_test - mean)/std

Next, we convert the labels into integers and initialize an LGBM classifier.

le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)
clf = LGBMClassifier(
n_estimators=50,
random_state=72,
extra_trees=True,
min_gain_to_split = 0.2,
min_data_in_leaf = 10
)
%%time
clf.fit(
X_train, y_train,
eval_set=[(X_test,y_test)],
callbacks=[early_stopping(100)]
)
predictions = clf.predict(X_test)
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test,predictions)
print(acc)

The model has an accuracy of 36% on test data, so certainly there is room for a lot of improvement, which I would work on. The intention behind this series of articles was to get my hands dirty with working on audio data and build a baseline model, which I think I have fairly achieved.

You can refer to my notebook here:

If you have enjoyed reading this series, and have suggestions of other case studies you would want me to cover, do let me know. Thanks for reading along, have a great day! :)

--

--