As a continuation of the
research carried out during my Ph.D., I have done some more work on the
problem of localising and tracking a human speaker by means of a
microphone array. One of the main flaws of currently existing
algorithms for acoustic source tracking (including
mine!) is that these methods are based on the
assumption of a stationary speech signal. Typically, they do not
account for the fact that the speech produced by a human speaker can
contain significant periods of silence between the separate utterances.
For a practical implementation based solely on the signals recorded at
a series of microphones, this fact can be quite detrimental to the
tracking performance.
Imagine for instance that the position of a speaker (e.g., a teacher in
a classroom) is tracked using the above method. If the algorithm is
accurate, everything goes well as long as the speaker emits speech. Now
imagine that this speaker stops talking, walks another 2 or 3 meters
while remaining silent, then resumes his/her presentation. Because our
tracking method is based solely on audio signals, tracking becomes
impossible during this extended period of silence. What will happen to
the tracker during this period? How can it keep track of the silent
"target"? This is indeed quite tricky for currently available tracking
algorithms. Because the development of these methods do not account for
these gaps in the speech signal, they will typically keep on tracking
during silence gaps as if the speaker was still active. As a result of
disturbances in the acoustic field (background noise and acoustic
reverberation), the tracker will then be driven by noise (rather than a
useful signal such as speech) and thus have a good chance of losing
track of the target for an extended period of time, even after the
speech signal resumes. That's basically why the existing speaker
tracking literature presents algorithms that are always tested with
speech signals having a minimal amount of silence in them (so as to
avoid presenting bad results!).

The contribution made by our recent work on this topic is to use the
output of a Voice Activity Detector (VAD) as part of the development of
the source tracking method. Specifically, the VAD data is "fused"
within the framework of the tracking algorithm, not simply used as a
basic ON/OFF switch that would determine when the source location
estimates are good/bad. As a result, the VAD data becomes an integral
part of the tracking method. The details of this method are given in
our paper [1].
We have also done some practical experiments
using a real speaker
moving randomly in a room fitted with an array of 8 microphones. An
exact description of the environmental setup can be found in [2,3], but
basically, the room dimensions were 3.5m x 3.1m x 2.2m, the average
reverberation time T
60 was about 0.27s, and the average SNR
about 20dB. Some movies have
been recorded to demonstrate the results of this algorithm (these
movies were created by
Anders Johansson, who has
also contributed to the development of this algorithm):
- Movie #1 pfvad1.avi (850kB)
- Movie #2 pfvad2.avi (850kB)
- Movie #3 pfvad3.avi (850kB)
- Movie #4 pfvad4.avi (900kB)
The black dot represents the true source position and the red star is
the speaker position estimates delivered by the tracker. These
multi-media files typically show how the area of uncertainty (ellipse
drawn around the source location estimate) becomes bigger whenever the
speaker is silent. This is a direct result of fusing the VAD output
within the tracking method and enables the algorithm to keep track of
the silent source (it actually does that by considering any potential
movement that the target might make during this silence period) and
then resume tracking successfully when the speech signal starts again.
(BTW, if you're wondering how we obtained the ground-truth data of the
true speaker position, check out our paper [2]!).
This method currently only works for a single speaker (single-target
tracking). The aim is however to develop this research further, and
based on this first implementation, to develop a new algorithm using
VAD data computed "locally" (i.e. in the direct vicinity of the speaker
rather than globally across the entire room). This new approach could
then be applied in a straightforward manner to the problem of
multiple-source tracking, which represents a fairly difficult
engineering problem, let alone when applied to audio-only and
speech-based applications.
References
| [1] |
Eric
A. Lehmann and Anders M. Johansson, Particle
filter with integrated voice activity detection for acoustic source tracking,
EURASIP Journal on Advances in Signal Processing, vol. 2007, Article ID 50870, 11 pages, 2007.
|
| [2] |
Eric
A. Lehmann and Anders M. Johansson, Experimental
Performance Assessment of a
Particle Filter with Voice Activity Data Fusion for Acoustic Speaker
Tracking, Proceedings of the IEEE Nordic Signal Processing
Symposium (NORSIG'06), pp. 126-129, Reykjavik, Iceland, June 2006.
|
| [3] |
Anders
M. Johansson, Eric A. Lehmann, and Sven Nordholm, Real-time implementation of a particle filter
with integrated voice activity detector for acoustic speaker tracking,
Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems
(APCCAS'06), pp. 1004-1007, Singapore, December 2006.
|