Acoustic source localisation and tracking

This page presents some results from my work on particle filtering for acoustic speaker localisation and tracking using microphone arrays. It basically documents (some of) the work I carried out on this topic during my time as a Research Engineer with National ICT Australia Ltd (NICTA) in 2004, and during my postdoctoral research with the Western Australian Telecommunications Research Institute (WATRI) between 2005 and 2008.

Real-time source tracking with 16 microphone array

As a result of my research on array-based speaker tracking in reverberant environments, a particle filter using the concept of importance sampling was developed and implemented in real-time on a standard desktop computer, running in conjunction with a 16 microphone array. Below are a few examples of what this real-time acoustic source tracker can do. The audio signal in the movie corresponds to the data recorded with microphone nr. 2, close to the bottom right-hand corner of the display. The green dot displays the estimated location of the acoustic source, the red markers show the history of the source location estimates over the last few frames (estimated source trajectory).

The only information used by the algorithm to localise and track the acoustic source are the signals recorded by the 16 microphones located at known positions in the room. The dimensions of the enclosure are about 3.5m x 4.5m x 2.7m. Odd-numbered microphones are placed at a height of 1.9m, even-numbered at 1.3m. The reverberation time (T60) measured in the room is 0.5s, which is quite substantial for this type of application.

Please be aware that these movies require the proper Xvid codec to be displayed correctly under Windows/Mac. Otherwise, the video stream might freeze from time to time, and the video and audio streams might get out of synchronisation. Open source software (mplayer, xine, vlc) don't seem to have a problem with it though! You can get this type of codecs from various places on the www. Get it for Windows and Mac e.g. here, or just Google it, e.g. like this!

Male speech examples:
Example #1 (261K): malespeech1.avi
Example #2 (248K): malespeech2.avi
Example #3 (250K): malespeech3.avi

Female speech examples:
Example #1 (137K): femalespeech1.avi
Example #2 (140K): femalespeech2.avi
Example #3 (199K): femalespeech3.avi
Example #4 (202K): femalespeech4.avi
Example #5 (202K): femalepseech5.avi

Pink noise examples:
Example #1 (266K): pinknoise1.avi
Example #2 (270K): pinknoise2.avi
Example #3 (252K): pinknoise3.avi

3-way conversation examples: one talker is mobile, the other two stationary
Example #1 (340K): conversation1.avi
Example #2 (348K): conversation2.avi

Integration of voice activity data

As a continuation of the research carried out during my Ph.D., I have done some more work on the problem of localising and tracking a human speaker by means of a microphone array. One of the main flaws of currently existing algorithms for acoustic source tracking (including mine, see above section!) is that these methods are based on the assumption of a stationary speech signal. Typically, they do not account for the fact that the speech produced by a human speaker can contain significant periods of silence between the separate utterances. For a practical implementation based solely on the signals recorded at a series of microphones, this fact can be quite detrimental to the tracking performance.

Imagine for instance that the position of a speaker (e.g., a teacher in a classroom) is tracked using the above method. If the algorithm is accurate, everything goes well as long as the speaker emits speech. Now imagine that this speaker stops talking, walks another 2 or 3 meters while remaining silent, then resumes his/her presentation. Because our tracking method is based solely on audio signals, tracking becomes impossible during this extended period of silence. What will happen to the tracker during this period? How can it keep track of the silent "target"? This is indeed quite tricky for currently available tracking algorithms. Because the development of these methods do not account for these gaps in the speech signal, they will typically keep on tracking during silence gaps as if the speaker was still active. As a result of disturbances in the acoustic field (background noise and acoustic reverberation), the tracker will then be driven by noise (rather than a useful signal such as speech) and thus have a good chance of losing track of the target for an extended period of time, even after the speech signal resumes. That's basically why the existing speaker tracking literature presents algorithms that are always tested with speech signals having a minimal amount of silence in them (so as to avoid presenting bad results!).

The contribution made by our recent work on this topic is to use the output of a Voice Activity Detector (VAD) as part of the development of the source tracking method. Specifically, the VAD data is "fused" within the framework of the tracking algorithm, not simply used as a basic ON/OFF switch that would determine when the source location estimates are good/bad. As a result, the VAD data becomes an integral part of the tracking method. The details of this method are given in our paper [1].

We have also done some practical experiments using a real speaker moving randomly in a room fitted with an array of 8 microphones. An exact description of the environmental setup can be found in [2,3], but basically, the room dimensions were 3.5m x 3.1m x 2.2m, the average reverberation time T60 was about 0.27s, and the average SNR about 20dB. Some movies have been recorded to demonstrate the results of this algorithm (these movies were created by Anders Johansson, who has also contributed to the development of this algorithm):
  1. Movie #1 pfvad1.avi (850kB)
  2. Movie #2 pfvad2.avi (850kB)
  3. Movie #3 pfvad3.avi (850kB)
  4. Movie #4 pfvad4.avi (900kB)
The black dot represents the true source position and the red star is the speaker position estimates delivered by the tracker. These multi-media files typically show how the area of uncertainty (ellipse drawn around the source location estimate) becomes bigger whenever the speaker is silent. This is a direct result of fusing the VAD output within the tracking method and enables the algorithm to keep track of the silent source (it actually does that by considering any potential movement that the target might make during this silence period) and then resume tracking successfully when the speech signal starts again. (BTW, if you're wondering how we obtained the ground-truth data of the true speaker position, check out our paper [2]!).

This method currently only works for a single speaker (single-target tracking). The aim is however to develop this research further, and based on this first implementation, to develop a new algorithm using VAD data computed "locally" (i.e. in the direct vicinity of the speaker rather than globally across the entire room). This new approach could then be applied in a straightforward manner to the problem of multiple-source tracking, which represents a fairly difficult engineering problem, let alone when applied to audio-only and speech-based applications.

Improved dynamics model for target tracking

This section presents some acoustic source tracking examples obtained with an improved particle filter (PF) algorithm. Specifically, the dynamics model used in the algorithm implementation has been optimised to better represent the range of possible human motions, rather than using an all-purpose dynamics model with standard parameter settings. The details of this particular work can be found in [4,5].

The movies below have been generated from the tracking results obtained in a real office room. The dimensions of the environment are 3.36m x 4.43m x 2.6m, with eight microphones (represented as grey circles) located at a height of 1.55m. The frequency-averaged reverberation time was practically measured in the room as T60 = 0.5s.
  1. Movie #1 (1.8MB)
  2. Movie #2 (1.9MB)
  3. Movie #3 (1.9MB)
In these movies, the star represents the true source position and the white circle is the speaker position estimate delivered by the particle filter. The dotted line shows the trajectory of the speaker, which was determined on the basis of the audio data itself using the high-accuracy beamforming approach described in [2]. The movies also show the area of uncertainty (ellipse), which becomes larger whenever the speaker is silent.

The main difference between these movies and the results obtained with our previous particle filtering implementations (as demonstrated on this page in the sections above, for instance) is in the evolution of the tracker's estimates during periods of silence. Whereas the estimates would simply appear "frozen" during such periods with previous implementations, the use of an optimised dynamics model here allows the tracker to keep following a silent speaker "blindly", to a certain extent, when no useful signal is available. This is demonstrated in the above movies as the white circle (PF estimate) tends to keep moving in the same general direction as the speaker during short breaks in the speech signal.


[1] Eric A. Lehmann and Anders M. Johansson, Particle filter with integrated voice activity detection for acoustic source tracking, EURASIP Journal on Advances in Signal Processing, vol. 2007, Article ID 50870, 11 pages, 2007.
[2] Eric A. Lehmann and Anders M. Johansson, Experimental Performance Assessment of a Particle Filter with Voice Activity Data Fusion for Acoustic Speaker Tracking, Proceedings of the IEEE Nordic Signal Processing Symposium (NORSIG'06), pp. 126-129, Reykjavik, Iceland, June 2006.
[3] Anders M. Johansson, Eric A. Lehmann, and Sven Nordholm, Real-time implementation of a particle filter with integrated voice activity detector for acoustic speaker tracking, Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems (APCCAS'06), pp. 1004-1007, Singapore, December 2006.
[4] Eric A. Lehmann, Anders M. Johansson, and Sven Nordholm, Modeling of Motion Dynamics and its Influence on the Performance of a Particle Filter for Acoustic Speaker Tracking, Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA'07), pp. 98-101, New Paltz, NY, USA, October 2007.
[5] Eric A. Lehmann and Anders M. Johansson, Dynamics Models for Acoustic Speaker Tracking—Preliminary Results, NICTA/WATRI Technical Report PRJ-NICTA-PM-023, Western Australian Telecommunications Research Institute, Perth, Australia, August 2007.