F0SIFT

From STX Wiki
Jump to navigationJump to search

F0SIFT - f0 extraction with autocorrelation

Usage:

F0SIFT X SR FMIN FMAX MODE TABLE COL

Inputs:
X signal vector (speech)
SR sampling rate in Hz
FMIN minimum f0
FMAX maximum f0
MODE tracking mode
TABLE name of output shell-table (note this must be an extended table) see NEW TABLE.
COL index of table column (note that this column must be of type NUMBER)
Outputs:
F0 f0 value in Hz or 0 if no f0 was found
Function:

This atom uses an autocorrelation method to extract the fundamental frequency from the speech signal X. The method is a modified version of the "simplified inverse filter tracking" algorithm published by J.D.Markel and A.H.Gray (J.D.Markel and A.H.Gray (1976), "Linear Prediction of Speech"; Springer, p206).

The speech signal is low-pass filtered and downsampled. The downsampling is applied to the signal to reduce the number of speech formants to 2. Because a minimum signal bandwidth of 2.FMAX is required for the extraction algorithm, the downsampling factor is selected as follows:

signal bandwidth b = max(2.FMAX, 2000Hz)
downsampling factor d = int(SR / (2.b))

The inverse filter coefficients are computed using the LPC method. The inverse filter (order 4) is applied to the downsampled speech signal to remove the formant structure.

The pitch period (= 1/f0) is measured in the autocorrelation function of the filtered signal. For the pitch period measurement the location of the highest autocorrelation peak in the range 1/FMAX..1/FMIN is used. To get a better frequency resolution the peak-location is corrected by a parabolic interpolation.

A tracking and correction procedure is used to correct the pitch value (e.g. octave jumps) and to remove incorrect voiced/unvoiced frames. For this procedure the last 3 frames are used.

Step 4 can be enabled or disabled via the input MODE (0|NO = disabled, 1|YES = enabled). If tracking is enabled, the output values are delayed by two frames and the last 2 stored values are always set to zero (because the tracking needs 3 frames). If the inputs TABLE and COL are connected, all values stored in the output F0 are also stored in the column COL of the table starting at entry 0.

Notes:

In order to simulate the original S.I.F.T. algorithm described in "Linear Prediction of Speech" the following parameter settings must be used:

sampling rate SR=10000 (10kHz)
frequency range FMIN=50, FMAX=250 (50..250Hz)
tracking enabled MODE=1

Navigation menu

Personal tools