ELEC-E5510 — Exercise 1: Feature extraction and Gaussian mixture models

The goal of this exercise is to get familiar with the feature extraction process used in automatic speech recognition and to learn about the Gaussian mixture models (GMMs) used to model the feature distributions. Use the submission instructions for returning your answers. The deadline is Wednesday 2.11.2022 at 23:59.

This exercise will utilize Matlab and a toolbox called GMM Bayes Toolbox (modified slightly for the purpose of the course) to demonstrate the feature extraction process and phoneme modelling and classification with GMMs.

Start Matlab (type matlab) and load the exercise data by typing:

addpath /work/courses/T/S/89/5150/general/ex1
addpath /work/courses/T/S/89/5150/general/ex1/gmmbayestb
load ex1data

MFCC feature extraction step by step

More information about each step of MFCC: Mel Frequency Cepstral Coefficient (MFCC) tutorial.

Let's start the exercise by extracting the so-called MFCC features of a sample word. Variable sampleword in the workspace contains a waveform of a Finnish word 'pyörremyrskyistä' sampled at the rate of 16000Hz. You can plot it by:

plot(sampleword)

The first step of the feature extraction is the computation of the short-time Fourier spectrum. With Matlab:

s = spectrogram(sampleword, hamming(400), 240);
imagesc(sqrt(abs(s)))
axis xy
sample_word_segmentation

This extracts the short-time Fourier spectrum of the sample word, using 25ms Hamming window and 10ms frame rate, and displays it. A custom Matlab-function sample_word_segmentation is provided to visualize the phoneme segmentation of the sample word.

It is advisable to apply a high-pass filter to the waveform before taking the spectrum. You can compare the results with and without the filter:

s2 = spectrogram(filter([1 -0.97], 1, sampleword), hamming(400), 240);
figure
imagesc(sqrt(abs(s2)))
axis xy
sample_word_segmentation

Continue with the filtered version of the spectrum. The next phase is to apply a non-linear frequency transformation. In the workspace, variable M contains a matrix that applied to the spectrum computes the so-called mel-spectrum. You can visualize the triangular filters in M:

plot(M', 'b')

After the mel-transformation a logarithm is taken of the frequency bin energies to compress the energy values. Visualize the resulting logarithmic mel-spectrum:

figure
imagesc(log(M*sqrt(abs(s2))+1))
axis xy
sample_word_segmentation

The last phase is to decorrelate the features and reduce the dimension. This is done using the discrete cosine transformation (DCT). A variable D contains the required matrix, you can again visualize it as with matrix M or plotting the matrix with

imagesc(D)
colorbar

if you like. The final features are obtained simply by:

figure
imagesc(D*log(M*sqrt(abs(s2))+1))
axis xy
sample_word_segmentation

The final features are called Mel-frequency cepstral coefficients or MFCCs.

Question 1

  1. What are the properties of MFCC features that make them well suited for automatic speech recognition?
  2. Why wouldn't spectrogram or mel-spectrum features work so well?

Gaussian mixture models

Gaussian mixture models (GMMs) are a very common feature distributions model in speech recognition. The advantages of the GMMs include flexibility, general-purposeness and the existence of an effective estimation algorithm. Most often GMMs are estimated using the Expectation-Maximization (EM) algorithm. It is an iterative algorithm that, starting from an initial model, improves the model such that the likelihood of the model is guaranteed not to decrease at any iteration. The drawbacks of the algorithm are that the number of mixture components must be known beforehand and that in general only a local maximum of the likelihood is found. But in practice, using some heuristics and a good initialization, the EM algorithm works very well.

If you are interested, you can learn more about the EM algorithm from this Gentle Tutorial of the EM algorithm

To begin, run a demo of a 2-dimensional GMM estimation using the EM algorithm:

gmm_demo

For the homework you will be training GMMs with the GMM Bayes Toolbox. The training data for the GMMs is in variable train_data, which contains a maximum of 3000 samples for each class, which are 17 most common Finnish phonemes. The samples have been taken from a database of 50 male speakers. There is no time structure in the training data samples, the samples have been taken from random positions of the phones. The class numbers of the training data are in variable train_class and the phoneme labels corresponding to the class numbers are in phonemes.

Using the wrapper function provided for the exercise, train a GMM by typing:

S = train_gmm(train_data, train_class, 10);

The number 10 is the number of mixture components for each class. The command returns a structure S that contains all the necessary information about the Gaussian mixture models. The differences to the 2-dimensional demo are that the mixture components are now 26-dimensional and the covariances of the mixture components have been restricted to be diagonal.

The newly trained GMM can be used for recognition by chaining the computation of density functions, normalization and decision processes:

result = gmmb_decide(gmmb_normalize(gmmb_pdf(train_data, S)));

result contains now the recognized class numbers of the training data. It is a large vector, to view a part of it decoded with the phoneme labels, type e.g.:

phonemes(result(2991:3010))

This shows the recognition results of 10 samples of /a/ and 10 samples of /e/ phones.

Lastly, you can compute the error percentage of the recognition by comparing the result to the reference class numbers:

length(find(result~=train_class))/length(train_class)*100

Question 2

Variables test_data and test_class contain the same kind of data as the training data but independent of the training data. In this case it was obtained from different speakers than the training data. Using the training data, train phoneme models with different numbers of components and evaluate their performance with both the training set and the independent test set.

Plot the error rates of both the train and the test sets with respect to the number of components in GMMs.

Answer the following questions:

  1. Why are the recognition results with the train and the test set different?
  2. What is a good number of components for recognizing an unknown set of samples?

Use the model of your choice in the rest of the exercise.

Question 3

Using your best model, classify the test data and generate a confusion matrix using the provided function confusion_matrix, e.g.:

C = confusion_matrix(result, test_class);

Type help confusion_matrix for information about the matrix. You can visualize the confusion matrix with:

plot_confusion(C, phonemes)

Answer the following questions:

  1. Based on the confusion matrix, what can you conclude about phoneme recognition as a task and recognition performance of different phoneme classifiers?
  2. Give examples of difficulties this classifier has.
  3. Include the (visualized) confusion matrix with the answer.

Question 4

Variables tw1, tw2 and tw3 contain the feature representations of three Finnish words. Classify their features using your best model. Try to identify the words based on the classification result.

Answer the following questions:

  1. What problems do you see in the frame based classification if one wants to recognize whole words?
  2. Describe ideas to improve the results.

Discriminative models

When we classified feature vectors above, we wanted to pick the most probable phoneme \(y\), given a feature vector \(x\). In other words, we wanted: \(\mathop{\arg\max}\limits_y p(y | x ) \). However, GMMs are a generative model: they learn \( p( x | y) \) and \( p(y) \). With the Bayes rule, these can be used to find the most probable class by: $$ \mathop{\arg\max}\limits_y p(y | x ) = \mathop{\arg\max}\limits_y p(x | y)p(y) $$

We can also construct a model for \( p( y | x) \) directly. This type of model is called a discriminative model. Here's an analogy adapted from Andrew Ng (from here): Say we want to classify animals as cats or dogs based their silhouettes.

The generative approach is to learn to draw dog silhouettes and cat silhouettes. Look at dogs, make notes about prominent features: four legs, a tail. Look at cats: four legs, a tail. To classify new silhouettes, estimate if you'd be more likely to draw something similar when drawing a cat or a dog.

The discriminative approach is to just look at a bunch of dog and cat pairs and figure out what the differences are. Big whiskers: it's a cat. Large animal: it's a dog. But ask the discriminative model how many legs a dog should have? No idea.

Whether the generative approach or the discriminative approach works better depends on the task. Empirically, in most speech recognition tasks the discriminative approach gives better results.

Deep neural networks

There is no shortage of discriminative classifiers, like logistic regression or support vector machines. But one specific family of discriminative methods has been particularly successful: deep neural networks (DNNs). At the beginning of the 2010-decade, many machine learning fields, like image classification and speech recognition, saw large leaps in the state-of-the-art results from applying some type of DNN methods. Nowadays almost all speech recognition systems replace GMMs by DNNs.

Although this course does not have time to go into the details of DNNs, it would seem strange not to mention them either, since they are now used everywhere. Aalto has a popular course, where you can learn more: CS-E4890 Deep Learning. For a clear introduction, see this video series.

If DNNs replace GMMs, what makes them better, then? There is nothing magic about them. We saw that they are a discriminative method. A generative approach actually assumes a model of how the data is generated. DNNs have less assumptions and instead learn distinguishing factors directly from the data. In fact they are especially good at this. Many aspects of MFCCs have been manually engineered to work with GMMs' assumptions; with DNNs some of the feature extraction steps can be skipped, and the classifier learns more powerful representations in a data-driven way.

Now we will try a DNN classifier for our phoneme recognition task. Download and run this script: dnn.m The training will take around six minutes. Matlab opens a nice visualisation that allows you to follow the progress. At the end of the script, a confusion matrix is plotted.

Question 5

DNNs can leverage large datasets, but here we have quite a small training set.

a) Which model performs classification better, the DNN or your best GMM?

b) The DNN training script tells you the number of parameters. Look inside your best model (struct S in Matlab). Note: strictly speaking, there is a separate GMM for each class (phoneme), so count the total number of parameters for all the classes altogether. How many trained (estimated) parameters does your best GMM have (Remember: we used diagonal-covariance matrix GMMs)? Which model type used more parameters?

Sampling

Since the GMM is a generative model, we can sample some MFCCs from it. Sample 100 MFCCs from each phoneme class by running:

sampled_class = repelem(1:17, 100)'; 
sampled_data = gmmb_sample(sample_class, S);
Notice that sampled_class has the reference class numbers.

Now classify them with the GMM model, just like you did with train_data and test_data. Then classify them with the DNN. The DNN needs the data in a different shape (you can have a look in dnn.m to see why). Run:

sampled_data_shaped = permute(reshape(sampled_data, size(sampled_data,1), size(sampled_data,2),1,1),[2 3 4 1]);
dnn_predictions = double(classify(network, samples_shaped));

Question 6

Which model has lower classification error on the sampled MFCCs? Why might that be?


elec-e5510@aalto.fi