SAD model
Overview
ldc-bpcsad performs speech activity detection (SAD) as a byproduct of broad phonetic class (BPC) recognition [3, 7, 8, 9]. The speech signal is run through a GMM-HMM based recognizer trained to recognize 5 broad phonetic classes: vowel, stops/affricate, fricative, nasal, and glide/liquid. Each contiguous sequence of BPCs is merged into a single speech segment and this segmentation smoothed to eliminate spurious short pauses. Input features are 13-D PLP features + first and second differences, extracted using a 20-channel filterbank covering 80 Hz to 4 kHz.
The system is implemented using Hidden Markov Model Toolkit (HTK) [10].
Training
The GMM and HMM transition parameters were trained using the phonetically transcribed portions of Buckeye Corpus [4] with the following mapping from the Buckeye phoneset to broad phonetic classes:
Vowel
: aa, aan, ae, aen, ah, ahn, ao, aon, aw, awn, ay, ayn, eh, ehn, ey, eyn, ih, ihn, iy, iyn, ow, own, oy, oyn, uh, uhn, uw, uwnStop/affricate
: p, t, k, tq, b, d, g, ch, jh, dx, nxFricative
: f, th, s, sh, v, dh, z, zh, hhNasal
: em, m, en, n, eng, ngGlide/liquid
: el, l, er, r, w, y
All other sounds were mapped to non-speech. This includes silence and environmental noise as well as non-speech vocalizations such as laughter, breaths, and coughs.
Performance
Below, we present SAD performance on the DIHARD III [5, 6] eval set, both overall and by domain:
domain accuracy precision recall f1 der dcf fa rate miss rate ------------------- ---------- ----------- -------- ----- ----- ----- --------- ----------- audiobooks 96.02 96.99 97.92 97.45 5.12 4.20 10.55 2.08 broadcast_interview 94.18 96.49 95.98 96.23 7.52 6.01 11.99 4.02 clinical 88.23 90.85 90.01 90.43 19.05 11.15 14.63 9.99 court 94.68 96.33 97.31 96.82 6.40 6.58 18.25 2.69 cts 94.16 97.83 95.57 96.69 6.55 7.65 17.30 4.43 maptask 91.87 91.39 96.46 93.86 12.63 6.77 16.43 3.54 meeting 81.53 98.52 78.71 87.51 22.47 17.33 5.46 21.29 restaurant 57.09 98.46 52.11 68.15 48.70 37.43 6.05 47.89 socio_field 86.54 97.73 84.61 90.70 17.35 13.24 6.79 15.39 socio_lab 90.58 96.37 91.03 93.62 12.40 9.44 10.84 8.97 webvideo 76.73 90.49 76.21 82.74 31.80 23.31 21.85 23.79 OVERALL 88.52 96.02 89.18 92.48 14.51 11.61 13.98 10.82
For domains containing generally clean recording conditions, high SNR, and low degree of speaker overlap, performance is good with DER generally <10%. In the presence of substantial overlapped speech, low SNR, or challenging environmental conditions, performance degrades. This is particularly noticeable for YouTube recordings (webvideo
domain) and speech recorded in restaurants (restaurant
). In the latter environment, DER rises to nearly 50%. Across all domains, performance is worse than state-of-the-art for this test set with deltas ranging from 1.88% DER (broadcast interview) to 31.56% (restaurant).
Full explanation of table columns:
domain
– DIHARD III recording domain; overall results reported underOVERALL
accuracy
– % total duration correctly classifiedprecision
– % detected speech that is speech according to the reference segmentationrecall
– % speech in the reference segmentation that was detectedf1
– F1 (computed fromprecision
/recall
)der
– detection error rate (DER) [1]dcf
– detection cost function (DCF) [2]; weighted function offa rate
andmiss rate
fa rate
– % non-speech incorrectly detected as speechmiss rate
– % speech that was not detected
References
Hervé Bredin. Pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. In INTERSPEECH, 3587–3591. 2017. URL: http://herve.niderb.fr/download/pdfs/Bredin2017a.pdf.
Frederick R. Byers. NIST Open Speech Analytic Technologies 2019 Evaluation Plan (OpenSAT19). 2018. URL: https://www.nist.gov/system/files/documents/2018/11/05/opensat19_evaluation_plan_v2_11-5-18.pdf.
Andrew K Halberstadt and James R Glass. Heterogeneous acoustic measurements for phonetic classification. In EUROSPEECH. 1997. URL: https://www.isca-speech.org/archive_v0/archive_papers/eurospeech_1997/e97_0401.pdf.
Mark Pitt, Laura Dilley, Keith Johnson, Scott Kiesling, William Raymond, Elizabeth Hume, and Eric Fosler-Lussier. Buckeye Corpus of Conversational Speech (2nd release). Department of Psychology, Ohio State University (Distributor), Columbus, OH, 2007. URL: www.buckeyecorpus.osu.edu.
Neville Ryant, Kenneth Church, Christopher Cieri, Jun Du, Sriram Ganapathy, and Mark Liberman. Third DIHARD Challenge Evaluation Plan. arXiv preprint arXiv:2006.05815, 2020. URL: https://arxiv.org/abs/2006.05815.
Neville Ryant, Prachi Singh, Venkat Krishnamohan, Rajat Varma, Kenneth Church, Christopher Cieri, Jun Du, Sriram Ganapathy, and Mark Liberman. The Third DIHARD Diarization Challenge. In INTERSPEECH. 2021. URL: https://arxiv.org/abs/2006.05815.
Tara N Sainath, Dimitri Kanevsky, and Bhuvana Ramabhadran. Broad phonetic class recognition in a hidden markov model framework using extended baum-welch transformations. In ASRU, 306–311. IEEE, 2007. URL: https://ieeexplore.ieee.org/abstract/document/4430129.
Tara N Sainath and Victor Zue. A comparison of broad phonetic and acoustic units for noise robust segment-based phonetic recognition. In INTERSPEECH. 2008. URL: https://www.isca-speech.org/archive_v0/archive_papers/interspeech_2008/i08_2378.pdf.
Patricia Scanlon, Daniel PW Ellis, and Richard B Reilly. Using broad phonetic group experts for improved speech recognition. IEEE transactions on audio, speech, and language processing, 15(3):803–812, 2007. URL: https://ieeexplore.ieee.org/abstract/document/4100697.
Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, and others. The HTK Book. Cambridge University Engineering Department, Cambridge, UK, 2002. URL: https://ai.stanford.edu/~amaas/data/htkbook.pdf.