F0SIFT
Contents
F0SIFT - f0 extraction with autocorrelation
Usage:
F0SIFT X SR FMIN FMAX MODE TABLE COL
Inputs:
X | signal vector (speech) |
SR | sampling rate in Hz |
FMIN | minimum f0 |
FMAX | maximum f0 |
MODE | tracking mode |
TABLE | name of output shell-table (note this must be an extended table) see NEW TABLE. |
COL | index of table column (note that this column must be of type NUMBER) |
Outputs:
F0 | f0 value in Hz or 0 if no f0 was found |
Function:
This atom uses an autocorrelation method to extract the fundamental frequency from the speech signal X. The method is a modified version of the "simplified inverse filter tracking" algorithm published by J.D.Markel and A.H.Gray (J.D.Markel and A.H.Gray (1976), "Linear Prediction of Speech"; Springer, p206).
The speech signal is low-pass filtered and downsampled. The downsampling is applied to the signal to reduce the number of speech formants to 2. Because a minimum signal bandwidth of 2.FMAX is required for the extraction algorithm, the downsampling factor is selected as follows:
signal bandwidth | b = max(2.FMAX, 2000Hz) |
downsampling factor | d = int(SR / (2.b)) |
The inverse filter coefficients are computed using the LPC method. The inverse filter (order 4) is applied to the downsampled speech signal to remove the formant structure.
The pitch period (= 1/f0) is measured in the autocorrelation function of the filtered signal. For the pitch period measurement the location of the highest autocorrelation peak in the range 1/FMAX..1/FMIN is used. To get a better frequency resolution the peak-location is corrected by a parabolic interpolation.
A tracking and correction procedure is used to correct the pitch value (e.g. octave jumps) and to remove incorrect voiced/unvoiced frames. For this procedure the last 3 frames are used.
Step 4 can be enabled or disabled via the input MODE (0|NO = disabled, 1|YES = enabled). If tracking is enabled, the output values are delayed by two frames and the last 2 stored values are always set to zero (because the tracking needs 3 frames). If the inputs TABLE and COL are connected, all values stored in the output F0 are also stored in the column COL of the table starting at entry 0.
Notes:
In order to simulate the original S.I.F.T. algorithm described in "Linear Prediction of Speech" the following parameter settings must be used:
sampling rate | SR=10000 (10kHz) |
frequency range | FMIN=50, FMAX=250 (50..250Hz) |
tracking enabled | MODE=1 |