BACK

Study on Speaker Recognition

Keywords:speech recognition,speaker recognition,HTK,MATLAB,HMM,GMM,Mixture,Cepstra,Pitch

Reporter:Liu Bing

Abstract:

This research concerns on the speaker recognition in a two-speaker system. Gaussian Mixture Model (GMM) is used to form the models of the speakers. The features, F0 and Lineal Prediction Cepstral Coefficients (LPCEPSTRAL Coefficients), are extracted as the main characteristics of the speakers. After the models are formed, the dialogues will be segmented into short sections. Each section will be recognized according to the speakers.

Speaker recognition is the process of automatically recognizing who is speaking by using speaker-specific information included in speech waves. There are two types of speaker recognition: Automatic Speaker Verification (ASV) and Automatic Speaker Identification (ASI). In ASV, speakers known to the system are customers, while unregistered speakers are impostors. ASI requires choosing which of N known voices best matches a test voice. This paper concerns on ASV using GMM models.

Main tools: C language and HTK tools

HTK is a toolkit for building Hidden Markov Models (HMMs). HMM can be used to model any time series.

Main Features: F0 and LPCEPSTAL Coefficients

To recognize the speech, the features of the speech are very necessary. There are two main features: F0 and LPCEPSTRAL Coefficients. F0 is the fundamental frequency of the vocal fold vibration, the physical aspect of speech corresponding to pitch.

Technique: GMM

HMM model is a popular statistical model method in the completed speaker recognition systems. The HMM method for ASR has been applied directly to ASI/V, replacing Markov models for words or phonemes with those for speakers. The Gaussian Mixture Model (GMM) is a special case of continuous HMM, when the HMM consists of only one state and its output probability density function is a mixture of Gaussian densities. The HMM models not only the underlying speech sounds, but also the temporal sequencing among these sounds. As the individual component densities of a multi-variate density, like the GMM, could model some underlying set of acoustic classes and it can form smooth approximations to arbitrary-shaped densities. So GMM model is used in this experiment.

Data: Six dialogues each spoken by two different speakers. Here they are a male and a female. The dialogues are ST01 ST02 ST03 ST04 ST08 and ST12.

Data Preparation

1. Select the dialogues of the speech that includes the male's speech and eliminate the unvoiced and noise parts of the speech, which is decided by the pitch and energy.

2. Calculate F0 of each speech according to the algorithm of F0 by a program.

3. Get LPCEPSTRAL Coefficients of each speech, which can be calculate by HTK tools.

4. Combine logarithm of F0 and LPCEPSTAL Coefficients and covert them into LPCEPSTRAL form.

Form GMM Model, Training and Testing

The speakers whose models will be formed are ST01 ST02 ST03 ST04 ST08 and ST12.

1. Form GMM models of the speakers by HTK tools with the feature files obtained above.

2. Build the recognizer.

Segmentation of the original dialogues

Each original dialogue was segmented to about 80 files, by means of detecting 0.15secs' pause which has energy below 4% of max-Energy of the whole dialogue. Each file will approximately contain the speech only spoken by one speaker. Then each file will be recognized according to the models formed formerly by GMM.

Recognition

For the male speaker the recognition result is like the following:

SP0136.rec

0 420000 ST01 -5561.198730 -16.551188

0 420000 ST03 -5595.554688 -16.653437

0 420000 ST04 -5599.363281 -16.664772

0 420000 ST02 -5604.058105 -16.678745

0 420000 ST08 -5611.431152 -16.700687

0 420000 ST12 -5646.538574 -16.805174

The recognition result is not good. To improve the accuracy, the ranking of the likelihood will be considered. If the result is good, then it can be used to recognize the customer from the operators. And a common model of the whole speakers will also be made to try to improve the accuracy.

Conclusion

For the speaker recognition system we are using two features such as LPCEPSTRAL Coefficients and F0. The use of combination at vector level kept the robustness and improved the rate of recognition. A program was made to calculate F0 and realize the combination. Using the features the speakers' models were formed. These models were used to recognize dialogues. These dialogues were segmented into files that were only spoken by one speaker. Each file has been recognized now. The result is not very good. I am expecting a better result in this speaker recognition.