Speech Processing

1. Sampling and Quantisation

Speech signal are typically quantised in amplitude and sampled in time. Quantisation of a signal sequence is achieved by sampling. The process of signal quantisation and sampling is know as pulse code modulation (PCA).

The dynamic range $R$ of the system is the range of numbers that represent a signal’s amplitude. Given $n$ bits quantisation, there are $2^n-1$ possible values can be represented, and the dynamic range it can represent is calculated by $R = 20\log_{10}(2^n-1)\text{ dB}$ According to Nyquist-Shannon sampling theorem, the Nyquist rate is twice the number of sample points per second as the highest frequency in a signal. If sampling rate/frequency $f_s$ (samples/second) is lower than the Nyquist rate, aliasing occurs. This means that energy at frequencies higher than the sampling rate are reflected back into the lower frequencies.

In order to avoid aliasing, it is usual to low-pass filter with cut-off frequency $f_c$ a signal before sampling, satisfying $f_c < 0.5f_s$.

2. Block Processing

Sample rate $f_s$ (Hz / samples per sec)

Sample period $T$ (sec per samples) $T = \cfrac{1}{f_s}$ Digital signal processing is typically performed on a fixed-length sequence of quantised samples called blocks or frames.

frame size $N$: number of samples per frame.
$NT$: seconds per frame. (Use to expresse frame size in time)
frame shift $R$: number of samples between the start of successive frames.

3. Waveform Processing

Short-time energy is the sum of squares of samples in one frame.

Zero-Crossing rate (ZCR) is the number of times the zero axis is crossed in one frame.

The autocorrelation function (ACF) computes the correlation of a signal with itself as a function in time.

Emphasises periodicity
basis for many spectrum analysis method
Short-time ACF (STACF) is basis for many pitch detectors, such as fundamental frequency estimators

\[r_k = \sum_{i=0}^{N-k-1}S_i \cdot S_{i+k}\]

3.1 Speech/Non-Speech Detection

A simple speech/nonspeech detector is based on measurements of short-time energy and ZCR.

energy is high in voiced speech
ZCR is high in unvoiced speech

3.2 Voiced/Unvoiced Detection

STACF and ZCR are used to construct a voiced/unvoiced detector.

autoccorrelation is periodic in voiced speech
ZCR is high in unvoiced speech

The correlation between two discrete-time signals $s$ and $t$ over an $N$ point interval is $q = \sum_{i=0}^{N-1}s_i \cdot t_i$ The cosine correlation $q_c$ between $s(nT) = A\cos(w_s nT)$ and $\cos(w_t nT)$ for $A\in \mathbb{R}, n\in \mathbb{N}$.

\[q_c = \begin {cases} \alpha A, & w_s=w_t \\ 0, & otherwise \end {cases}\]

Based on Fourier analysis, cosine correlation can be use to extract the cosine components of an arbitrary signal if and only if when

\[\begin{eqnarray} w_t &=& \cfrac{2\pi p}{NT} \qquad for\ p=0, 1, ...,N-1 \\ f_t &=& \cfrac{p}{NT} \end{eqnarray}\]

So, the spectrum computed by cosine correlation is

\[S_p = \sum_{n=0}^{N-1} S_n \cdot \cos(\cfrac{2\pi np}{N})\qquad p=0, 1, ..., N-1\]

4. Sine and cosine correlation

Cosine correlation with consine is a cosine: $\alpha \cos(\phi)$, while sine correlation with cosine is a sine: $\alpha \sin(\phi)$

The amplitude of sinusoidal component independent of phase is given by

\[\alpha = \sqrt{(cosine\ correlation)^2 + (sine\ correlation)^2}\]

The phase of this component is given by

\[\phi = \tan^{-1} \cfrac{sine\ correlation}{cosine\ correlation}\]

5. General Spectral Analysis Algorithm - Discrete Fourier transform (DFT)

Processing a simple frame which size is $N$ with a sample period $T$. Its spectrum is $s(nT)$. The angular frequency of fundamental sinusoid is $\omega= \cfrac{2\pi}{NT}$, and its frequency $f = \cfrac{\omega}{2\pi} = \cfrac{1}{NT}$. $\Omega_p$ denotes the frequency at $p\omega$, where $p = 0, 1, …, N-1$. The cosine correlation $c(\Omega_p)$ and the sine correlation $s(\Omega_p)$ is given by

\[\begin{eqnarray} c(\Omega_p) &=& \sum_{n=0}^{N-1} s(nT) \cos(\Omega_p \cdot nT) \\ s(\Omega_p) &=& \sum_{n=0}^{N-1} s(nT) \sin(\Omega_p \cdot nT) \end{eqnarray}\]

The amplitude/magnitude and phase of the individual components at $\Omega_p$ is

\[\begin{eqnarray} a_p &=& \sqrt{c(\Omega_p)^2 + s(\Omega_p)^2} \\ \phi_p &=& \tan^{-1} (\cfrac{s(\Omega_p)}{c(\Omega_p)}) \end{eqnarray}\]

The DFT is often expressed using complex number notation.

\[\begin{eqnarray} S_p &=& c(\Omega_p) - js(\Omega_p) &=& \sum_{n=0}^{N-1} s(nT)e^{-j(\Omega_p \cdot nT)} &=& \sum_{n=0}^{N-1} s(nT)e^{-j(\Omega_p \cdot nT)} \end{eqnarray}\]

The inverse DFT is given by

\[\begin{eqnarray} s(nT) &=& \cfrac{1}{N}\sum_{p=0}^{N-1}S_pe^{j(\Omega_p \cdot nT)} &=& \cfrac{1}{N}\sum_{p=0}^{N-1}S_pe^{j(\Omega_p \cdot nT)} \end{eqnarray}\]

This algorithm assumes periodicity outside the analysis frame with a period equal to the frame length. When the current period of frame is discontinued by the following period, the unwanted spectral components occur. These cause distortion.

6. Distort and Windowing

The distortion can be reduced by multiplying each signal frame with a window function, such that the each periods are continuities. However, windowing not only attenuates the components caused by the discontinuity, but also smears the spectral peaks.

Hamming Window is the most common window function given by

\[w(nT) = 0.54 - 0.46\cos(\cfrac{2\pi n}{N-1})\]

7. General Spectral Analysis Algorithm - The Fast Fourier transform (FFT)

FFT requires that the size of window/frame should be a power of 2. This can be achieved by the following two methods:

Choosing the appropriate analysis frame size
Zero-padding a frame to the nearest power of 2

8. The Z transform

// TO DO

9. Digital Coding

Digital signals are characterised in terms of data rate (bits/second - bps). $\text{data rate} = \text{amplitude quantisation(bits/sample)} \times \text{sampling rate(samples/second)}$ Speech has a bandwidth of $\text{~} 10$ kHz and a dynamic range of $\text{~}50$ dB. Hence, the minimum quantisation and sampling require 20 kHz sampling rate and 8 bits quantisation, which is ~160 kbps data rate.

The information rate in speech is estimated to be only ~100 bps, containing ~50 bps linguistic information and ~50 bps paralinguistic information.

Digital speech codecs make good use of lossy compression schemes by exploiting the source-filter model of speech. The way to code signals at lower rates is to exploit any redundancies in the signal. In speech processing, this is achieved using predictive models, which ultimately contain speech recognition and speech synthese.

DFT, Speech/non-speech, voiced/unvoiced...