Command-line tool
Basic usage
The easiest way to perform speech activity detection (SAD) for a set of audio files is via the ldc-bpcsad command line tool. To perform SAD for channel 1 of each of a set of audio files rec1.flac
, rec2.flac
, rec3.flac
, … and output their segmentation as HTK label files under the directory label_dir
:
ldc-bpcsad --channel 1 --output-dir label_dir rec1.flac rec2.flac rec3.flac ...
This will result in one label file for input file (e.g., rec1.lab
, rec2.lab
, …), each of the form:
0.00 1.05 nonspeech 1.05 3.55 speech 3.55 4.65 nonspeech . . .
Script files
It is also possible to specify the audio files and channels to be processed using a script file specified via the --scp
flag. Currently, two script file formats are supported:
htk
– HTK script file (default)json
– JSON script file
HTK script file
If --scp-fmt htk
is specified, ldc-bpcsad will load the audio files to be segmented from an HTK script file. An HTK script file consists of a list of file paths, one path per line; e.g.:
/path/to/rec1.flac /path/to/rec2.flac /path/to/rec3.flac
For instance, if task.scp
is the above HTK script file, then:
ldc-bpcsad --channel 1 --output-dir label_dir --scp-fmt htk --scp task.scp
is equivalent to:
ldc-bpcsad --channel 1 --output-dir label_dir /path/to/rec1.flac /path/to/rec2.flac /path/to/rec3.flac
JSON script file
If --scp-fmt json
is specified, ldc-bpcsad will load the audio files AND channels to be segmented from a JSON file. The JSON file should consist of a sequence of JSON objects, each containing the following three key-value pairs:
audio_path
– Path to audio file to perform SAD on.channel
– Channel number of audio file to perform SAD on (1-indexed).channel_id
– Basename for output file containing SAD result.
E.g.:
[{ "audio_path": "/path/to/rec1.flac", "channel_id": "rec1_c1", "channel": 1 }, { "audio_path": "/path/to/rec1.flac", "channel_id": "rec1_c2", "channel": 2 }, { "audio_path": "/path/to/rec2.flac", "channel_id": "rec2_c1", "channel": 1 }]
For instance, if task.json
is the above JSON file, then:
ldc-bpcsad --output-dir label_dir --scp-fmt json --scp task.json
will output the following three HTK label files to label_dir
:
rec1_c1.lab
– result of SAD for channel 1 ofrec1.flac
rec1_c2.lab
– result of SAD for channel 2 ofrec1.flac
rec2_c1.lab
– result of SAD for channel 1 ofrec2.flac
Note
When using a JSON script file, the --channel
flag has no effect.
Output formats
The output file format for SAD output can be specified via the --output-fmt
flag. Currently, four options are available:
htk
– HTK label file (default)audacity
– Audacity label filetextgrid
– Praat TextGrid
HTK label file
If --output-fmt htk
is specified, SAD output will be stored as HTK label files. Each label file contains one segment per line, each line having the form:
<ONSET>\t<OFFSET>\t<LABEL>
where:
ONSET
– onset of segment in seconds from beginning of recordingOFFSET
– offset of segment in seconds from beginning of recordingLABEL
– segment label; either “speech” or “nonspeech”
The segments are stored in order with the following guarantees:
the onset of the first segment is always 0
the offset of the final segment is always equal to the recording duration
the offset of segment
n
equals the onset of segmentn+1
.
E.g.:
0.00 1.05 nonspeech 1.05 3.55 speech 3.55 4.65 nonspeech
RTTM file
If --output-fmt rttm
is specified, SAD output will be stored as Rich Transcription Time Marked (RTTM) files. Each RTTM file contains one speech segment per line, with each line having the form:
SPEAKER <FILE-ID> <CHANNEL> <ONSET> <DURATION> <NA> <NA> speaker <NA> <NA>
where:
FILE-ID
– file name; the basename of the audio file that the turn is on, minus extension (e.g.,rec1_a
)CHANNEL
– the channel number of the turn on the audio file (1-indexed)ONSET
– onset of turn in seconds from beginning of recordingDURATION
– duration of turn in seconds
E.g.:
SPEAKER rec1 1 1.05 2.50 <NA> <NA> speaker <NA> <NA> SPEAKER rec1 1 4.00 3.31 <NA> <NA> speaker <NA> <NA> SPEAKER rec1 1 10.11 4.15 <NA> <NA> speaker <NA> <NA>
Audacity label file
If --output-fmt audacity
is specified, SAD output will be stored as Audacity label files . As we are not using any of the optional features of this file forma (e.g., frequency ranges), the resulting files are exactly identical to the HTK label files previously described and this is functionally an alias for --output-fmt htk
except with a different file extension (HTK: .lab
, Audacity: .txt
).
Praat TextGrid
If --output-fmt textgrid
is specified, SAD output will be stored as Praat TextGrid files. Each TextGrid file will contain a single IntervalTier named sad
, consisting of a sequence of intervals whose attributes should be interpreted as follows:
xmin
– onset of segment in seconds from beginning of recordingxmax
– offset of segment in seconds from beginning of recordingtext
– segment label; either “speech” or “nonspeech”
E.g.:
File type = "ooTextFile" Object class = "TextGrid" xmin = 0 xmax = 4.65 tiers? <exists> size = 1 item []: item [1]: class = "IntervalTier" name = "sad" xmin = 0 xmax = 5.0 intervals: size = 3 intervals [1]: xmin = 0 xmax = 1.05 text = "non-speech" intervals [2]: xmin = 1.05 xmax = 3.55 text = "speech" intervals [3]: xmin = 3.55 xmax = 4.65 text = "non-speech"
Postprocessing
By default ldc-bpcsad postprocesses it’s output to eliminate speech segments less than 500 ms in duration and nonspeech segments less than 300 ms in duration. While these defaults are suitable for SAD that is being done as a precursor to transcription by human annotators, they may be overly restrictive for other uses. If necessary, the minimum speech and nonspeech segment durations may be changed via the --speech
and --nonspeech
flags. For instance, to instead use minimum durations of 250 ms for speech and 100 ms for nonspeech:
ldc-bpcsad --channel 1 --output-dir label_dir --speech 0.250 --nonspeech 0.100 rec1.flac rec2.flac rec3.flac
Audio formats
This section describes the default supported input audio file formats. As audio IO is handled by the soundfile Python package, additional formats may be supported depending on your installed version of soundfile. To see if additional formats are supported, run:
ldc-bpcsad -h
and check the audio file formats
list at the end of the help message.
Supported formats:
.aiff
,.aif
– AIFF (Apple/SGI).au
,.snd
– AU (Sun/NeXT).avr
– AVR (Audio Visual Research).caf
– CAF (Apple Core Audio File).flac
– FLAC (Free Lossless Audio Codec).htk
– HTK (HMM Tool Kit).iff
– IFF (Amiga IFF/SVX8/SV16).mat
,.mat4
,.mat5
– Matlab 4.2/5.0 (GNU Octave 2.0/2.1).mpc
– Musepack MPC (Akai MPC 2k).ogg
,.vorbis
– OGG Vorbis compressed audio.paf
,.fap
– Ensoniq PARIS file format.pvf
– PVF (Portable Voice Format).rf64
– EBU RF64 enhancement of MBWF.sd2
– Sound Designer 2 format.sds
– MIDI Sample Dump Standard.sf
– IRCAM SDIF (Institut de Recherche et Coordination Acoustique/Musique Sound Description Interchange Format).sph
,.nist
,.wav
– NIST SPEHERE formatl SHORTEN compression is not supported.voc
– Sound Blaster VOC files.w64
– Sonic Foundry 64-bit RIFF/WAV format.wav
– Microsoft .WAV RIFF format.wve
– Psion 8-bit A-law.xi
– Fasttracker 2 Extended Instrument format.
Debugging
When ldc-bpcsad has a problem segmenting a file (e.g., bad file path, unsupported format, HTK error), it will log to STDERR that a problem was encountered and skip the file:
ldc-bpcsad --channel 1 --output-dir label_dir rec1.flac rec2.sph SAD failed for channel 1 of "rec2.sph". Skipping. For more details rerun with the --debug flag.
To troubleshoot precisely why a file was skipped, rerun with the -debug
flag. This will enable debug mode, which produces MUCH more voluminous output including, among other items:
whether or not the audio file exists and is in an understood format
whether or not the selected channel exists on the audio file
basic properties of audio file (e.g., format, number of channels, duration)
any exceptions that arose during decoding
E.g.:
ldc-bpcsad --debug --channel 1 --output-dir label_dir rec1.flac rec2.sph SAD failed for channel 1 of "rec1.flac". Skipping. For more details rerun with the --debug flag. DEBUG: COMMAND LINE CALL: /usr/local/bin/ldc-bpcsad --debug --channel 1 --output-dir label_dir rec1.flac rec2.sph DEBUG: Flag "--n-jobs" is ignored for debug mode. Using single-threaded implementation. DEBUG: Progress bar is disabled for debug mode. DEBUG: DEBUG: ######################################################################## DEBUG: Attempting SAD. DEBUG: ######################################################################## DEBUG: Source audio file: rec1.flac samplerate: 16000 Hz channels: 1 duration: 1e+01:4.000 min format: FLAC (Free Lossless Audio Codec) [FLAC] subtype: Signed 16 bit PCM [PCM_16] endian: FILE sections: 1 frames: 9664000 extra_info: """ File : rec1.flac Length : 6680195 FLAC Stream Metadata Channels : 1 Sample rate : 16000 Frames : 9664000 Bit width : 16 Vorbis Comment Metadata comment : Processed by SoX End """ DEBUG: DEBUG: Source channel: 1. DEBUG: DEBUG: Decoding chunk: CHUNK_ONSET: 0.000, CHUNK_OFFSET: 604.000, CHUNK_DUR: 604.000 DEBUG: Saving SAD to "label_dir/rec1.lab". DEBUG: Output file format: htk. DEBUG: ######################################################################## DEBUG: Attempting SAD. DEBUG: ######################################################################## DEBUG: Error opening 'rec2.sph': File contains data in an unimplemented format. DEBUG: To see supported formats, run: DEBUG: DEBUG: ldc-bpcsad --help WARNING: SAD failed for channel 1 of "rec2.sph". Skipping. For more details rerun with the --debug flag.
Usage
Perform speech activity detection on audio files.
usage: ldc-bpcsad [-h] [--channel CHAN] [--scp SCP] [--scp-fmt SCP-FMT]
[--output-dir OUTPUT-DIR] [--output-fmt OUTPUT-FMT]
[--speech SPEECH-DUR] [--nonspeech NONSPEECH-DUR]
[--speech-scale-factor SPEECH-SCALE] [--disable-progress]
[--debug] [--n-jobs INT] [--version]
[audio-path ...]
Positional Arguments
- audio-path
audio files to be processed
Named Arguments
- --channel
channel (1-indexed) to process on each audio file (Default: 1)
Default: 1
- --scp
path to script file (Default: None)
- --scp-fmt
Possible choices: htk, json
script file format (Default: “htk”)
Default: “htk”
- --output-dir
output segmentations to OUTPUT-DIR (Default: current directory)
Default: /home/runner/work/ldc-bpcsad/ldc-bpcsad/docs
- --output-fmt
Possible choices: audacity, htk, rttm, textgrid
output file format (Default: “htk”)
Default: “htk”
- --speech
filter speech segments shorter than SPEECH-DUR seconds (Default: 0.5)
Default: 0.5
- --nonspeech
merge speech segments separated by less than NONSPEECH-DUR seconds (Default: 0.3)
Default: 0.3
- --speech-scale-factor
post-multiply speech model acoustic likelihoods by SPEECH-SCALE (Default: 1.0)
Default: 1.0
- --disable-progress
disable progress bar
Default: False
- --debug
enable DEBUG mode
Default: False
- --n-jobs, -j
set num threads to use (Default: 1)
Default: 1
- --version
show program’s version number and exit
audio file formats: AIFF (Apple/SGI), AU (Sun/NeXT), AVR (Audio Visual Research), CAF (Apple Core Audio File), FLAC (Free Lossless Audio Codec), HTK (HMM Tool Kit), IFF (Amiga IFF/SVX8/SV16), MAT4 (GNU Octave 2.0 / Matlab 4.2), MAT5 (GNU Octave 2.1 / Matlab 5.0), MPC (Akai MPC 2k), OGG (OGG Container format), PAF (Ensoniq PARIS), PVF (Portable Voice Format), RAW (header-less), RF64 (RIFF 64), SD2 (Sound Designer II), SDS (Midi Sample Dump Standard), SF (Berkeley/IRCAM/CARL), VOC (Creative Labs), W64 (SoundFoundry WAVE 64), WAV (Microsoft), WAV (NIST Sphere), WAVEX (Microsoft), WVE (Psion Series 3), XI (FastTracker 2)