Acoustic source localisation and tracking
This page presents some results from my work on particle filtering for acoustic speaker localisation and tracking using microphone arrays. It basically documents (some of) the work I carried out on this topic during my time as a Research Engineer with National ICT Australia Ltd (NICTA) in 2004, and during my postdoctoral research with the Western Australian Telecommunications Research Institute (WATRI) between 2005 and 2008.
Real-time source tracking with 16 microphone array
As a result of my research on array-based speaker tracking in reverberant environments, a particle filter using the concept of importance sampling was developed and implemented in real-time on a standard desktop computer, running in conjunction with a 16 microphone array. Below are a few examples of what this real-time acoustic source tracker can do. The audio signal in the movie corresponds to the data recorded with microphone nr. 2, close to the bottom right-hand corner of the display. The green dot displays the estimated location of the acoustic source, the red markers show the history of the source location estimates over the last few frames (estimated source trajectory).
The only information used by the algorithm to localise and track the acoustic source are the signals recorded by the 16 microphones located at known positions in the room. The dimensions of the enclosure are about 3.5m x 4.5m x 2.7m. Odd-numbered microphones are placed at a height of 1.9m, even-numbered at 1.3m. The reverberation time (T60) measured in the room is 0.5s, which is quite substantial for this type of application.
Please be aware that these movies require the proper Xvid codec to be displayed correctly under Windows/Mac. Otherwise, the video stream might freeze from time to time, and the video and audio streams might get out of synchronisation. Open source software (mplayer, xine, vlc) don't seem to have a problem with it though! You can get this type of codecs from various places on the www. Get it for Windows and Mac e.g. here, or just Google it, e.g. like this!
Male speech examples:
Example #1 (261K): malespeech1.avi
Example #2 (248K): malespeech2.avi
Example #3 (250K): malespeech3.avi
Female speech examples:
Example #1 (137K): femalespeech1.avi
Example #2 (140K): femalespeech2.avi
Example #3 (199K): femalespeech3.avi
Example #4 (202K): femalespeech4.avi
Example #5 (202K): femalepseech5.avi
Pink noise examples:
Example #1 (266K): pinknoise1.avi
Example #2 (270K): pinknoise2.avi
Example #3 (252K): pinknoise3.avi
3-way conversation examples: one talker is mobile, the other two stationaryExample #1 (340K): conversation1.avi
Example #2 (348K): conversation2.avi
voice activity data
As a continuation of the research carried out during my Ph.D., I have done some more work on the problem of localising and tracking a human speaker by means of a microphone array. One of the main flaws of currently existing algorithms for acoustic source tracking (including mine, see above section!) is that these methods are based on the assumption of a stationary speech signal. Typically, they do not account for the fact that the speech produced by a human speaker can contain significant periods of silence between the separate utterances. For a practical implementation based solely on the signals recorded at a series of microphones, this fact can be quite detrimental to the tracking performance.
Imagine for instance that the position of a speaker (e.g., a teacher in a classroom) is tracked using the above method. If the algorithm is accurate, everything goes well as long as the speaker emits speech. Now imagine that this speaker stops talking, walks another 2 or 3 meters while remaining silent, then resumes his/her presentation. Because our tracking method is based solely on audio signals, tracking becomes impossible during this extended period of silence. What will happen to the tracker during this period? How can it keep track of the silent "target"? This is indeed quite tricky for currently available tracking algorithms. Because the development of these methods do not account for these gaps in the speech signal, they will typically keep on tracking during silence gaps as if the speaker was still active. As a result of disturbances in the acoustic field (background noise and acoustic reverberation), the tracker will then be driven by noise (rather than a useful signal such as speech) and thus have a good chance of losing track of the target for an extended period of time, even after the speech signal resumes. That's basically why the existing speaker tracking literature presents algorithms that are always tested with speech signals having a minimal amount of silence in them (so as to avoid presenting bad results!).
The contribution made by our recent work on this topic is to use the output of a Voice Activity Detector (VAD) as part of the development of the source tracking method. Specifically, the VAD data is "fused" within the framework of the tracking algorithm, not simply used as a basic ON/OFF switch that would determine when the source location estimates are good/bad. As a result, the VAD data becomes an integral part of the tracking method. The details of this method are given in our paper .
We have also done some practical experiments using a real speaker moving randomly in a room fitted with an array of 8 microphones. An exact description of the environmental setup can be found in [2,3], but basically, the room dimensions were 3.5m x 3.1m x 2.2m, the average reverberation time T60 was about 0.27s, and the average SNR about 20dB. Some movies have been recorded to demonstrate the results of this algorithm (these movies were created by Anders Johansson, who has also contributed to the development of this algorithm):
This method currently only works for a single speaker (single-target tracking). The aim is however to develop this research further, and based on this first implementation, to develop a new algorithm using VAD data computed "locally" (i.e. in the direct vicinity of the speaker rather than globally across the entire room). This new approach could then be applied in a straightforward manner to the problem of multiple-source tracking, which represents a fairly difficult engineering problem, let alone when applied to audio-only and speech-based applications.
model for target tracking
A. Lehmann and Anders M. Johansson, Particle
filter with integrated voice activity detection for acoustic source tracking,
EURASIP Journal on Advances in Signal Processing, vol. 2007, Article ID 50870, 11 pages, 2007.
A. Lehmann and Anders M. Johansson, Experimental
Performance Assessment of a
Particle Filter with Voice Activity Data Fusion for Acoustic Speaker
Tracking, Proceedings of the IEEE Nordic Signal Processing
Symposium (NORSIG'06), pp. 126-129, Reykjavik, Iceland, June 2006.
M. Johansson, Eric A. Lehmann, and Sven Nordholm, Real-time implementation of a particle filter
with integrated voice activity detector for acoustic speaker tracking,
Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems
(APCCAS'06), pp. 1004-1007, Singapore, December 2006.
|||Eric A. Lehmann, Anders
M. Johansson, and Sven Nordholm, Modeling
of Motion Dynamics and its Influence on the Performance of a Particle
Filter for Acoustic Speaker Tracking, Proceedings of the IEEE Workshop
on Applications of Signal Processing to Audio and Acoustics (WASPAA'07), pp.
98-101, New Paltz, NY, USA, October 2007.
|||Eric A. Lehmann and
Anders M. Johansson, Dynamics Models
for Acoustic Speaker TrackingPreliminary Results, NICTA/WATRI
Technical Report PRJ-NICTA-PM-023, Western Australian Telecommunications
Research Institute, Perth, Australia, August 2007.