Speech Augmentation

A popular saying in machine learning is "there is no better data than more data". However, collecting new data can be expensive and we must cleverly use the available dataset. One popular technique is called speech augmentation. The idea is to artificially corrupt the original speech signals to give the network the "illusion" that we are processing a new signal. This acts as a powerful regularizer, that normally helps neural networks improving generalization and thus achieve better performance on test data.

Open in Google Colab

Fourier Transform and Spectrograms

In speech and audio processing, the signal in the time-domain is often transformed into another domain. Ok, but why do we need to transform an audio signal? Some speech characteristics/patterns of the signal (e.g, pitch, formats) might not be very evident when looking at the audio in the time-domain. With properly designed transformations, it might be easier to extract the needed information from the signal itself. The most popular transformation is the Fourier Transform, which turns the time-domain signal into an equivalent representation in the frequency domain. In the following sections, we will describe the Fourier transforms along with other related transformations such as Short-Term Fourier Transform (STFT) and spectrograms.

Open in Google Colab

Environmental corruption

In realistic speech processing applications, the signal recorded by the microphone is corrupted by noise and reverberation. This is particularly harmful in distant-talking (far-field) scenarios, where the speaker and the reference microphone are distant (think about popular devices such as Google Home, Amazon Echo, Kinect, and similar devices).

Open in Google Colab