Speaker localisation and tracking ... take II.


As a continuation of the research carried out during my Ph.D., I have done some more work on the problem of localising and tracking a human speaker by means of a microphone array. One of the main flaws of currently existing algorithms for acoustic source tracking (including mine!) is that these methods are based on the assumption of a stationary speech signal. Typically, they do not account for the fact that the speech produced by a human speaker can contain significant periods of silence between the separate utterances. For a practical implementation based solely on the signals recorded at a series of microphones, this fact can be quite detrimental to the tracking performance.

Imagine for instance that the position of a speaker (e.g., a teacher in a classroom) is tracked using the above method. If the algorithm is accurate, everything goes well as long as the speaker emits speech. Now imagine that this speaker stops talking, walks another 2 or 3 meters while remaining silent, then resumes his/her presentation. Because our tracking method is based solely on audio signals, tracking becomes impossible during this extended period of silence. What will happen to the tracker during this period? How can it keep track of the silent "target"? This is indeed quite tricky for currently available tracking algorithms. Because the development of these methods do not account for these gaps in the speech signal, they will typically keep on tracking during silence gaps as if the speaker was still active. As a result of disturbances in the acoustic field (background noise and acoustic reverberation), the tracker will then be driven by noise (rather than a useful signal such as speech) and thus have a good chance of losing track of the target for an extended period of time, even after the speech signal resumes. That's basically why the existing speaker tracking literature presents algorithms that are always tested with speech signals having a minimal amount of silence in them (so as to avoid presenting bad results!).

The contribution made by our recent work on this topic is to use the output of a Voice Activity Detector (VAD) as part of the development of the source tracking method. Specifically, the VAD data is "fused" within the framework of the tracking algorithm, not simply used as a basic ON/OFF switch that would determine when the source location estimates are good/bad. As a result, the VAD data becomes an integral part of the tracking method. The details of this method are given in our paper [1].

We have also done some practical experiments using a real speaker moving randomly in a room fitted with an array of 8 microphones. An exact description of the environmental setup can be found in [2,3], but basically, the room dimensions were 3.5m x 3.1m x 2.2m, the average reverberation time T60 was about 0.27s, and the average SNR about 20dB. Some movies have been recorded to demonstrate the results of this algorithm (these movies were created by Anders Johansson, who has also contributed to the development of this algorithm):
  1. Movie #1 pfvad1.avi (850kB)
  2. Movie #2 pfvad2.avi (850kB)
  3. Movie #3 pfvad3.avi (850kB)
  4. Movie #4 pfvad4.avi (900kB)
The black dot represents the true source position and the red star is the speaker position estimates delivered by the tracker. These multi-media files typically show how the area of uncertainty (ellipse drawn around the source location estimate) becomes bigger whenever the speaker is silent. This is a direct result of fusing the VAD output within the tracking method and enables the algorithm to keep track of the silent source (it actually does that by considering any potential movement that the target might make during this silence period) and then resume tracking successfully when the speech signal starts again. (BTW, if you're wondering how we obtained the ground-truth data of the true speaker position, check out our paper [2]!).

This method currently only works for a single speaker (single-target tracking). The aim is however to develop this research further, and based on this first implementation, to develop a new algorithm using VAD data computed "locally" (i.e. in the direct vicinity of the speaker rather than globally across the entire room). This new approach could then be applied in a straightforward manner to the problem of multiple-source tracking, which represents a fairly difficult engineering problem, let alone when applied to audio-only and speech-based applications.


References

[1] Eric A. Lehmann and Anders M. Johansson, Particle filter with integrated voice activity detection for acoustic source tracking, EURASIP Journal on Advances in Signal Processing, vol. 2007, Article ID 50870, 11 pages, 2007.
[2] Eric A. Lehmann and Anders M. Johansson, Experimental Performance Assessment of a Particle Filter with Voice Activity Data Fusion for Acoustic Speaker Tracking, Proceedings of the IEEE Nordic Signal Processing Symposium (NORSIG'06), pp. 126-129, Reykjavik, Iceland, June 2006.
[3] Anders M. Johansson, Eric A. Lehmann, and Sven Nordholm, Real-time implementation of a particle filter with integrated voice activity detector for acoustic speaker tracking, Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems (APCCAS'06), pp. 1004-1007, Singapore, December 2006.


Home