Command-line tool 

Basic usage 

The easiest way to perform speech activity detection (SAD) for a set of audio files is via the ldc-bpcsad command line tool. To perform SAD for channel 1 of each of a set of audio files rec1.flac, rec2.flac, rec3.flac, … and output their segmentation as HTK label files under the directory label_dir:

ldc-bpcsad --channel 1 --output-dir label_dir rec1.flac rec2.flac rec3.flac ...

This will result in one label file for input file (e.g., rec1.lab, rec2.lab, …), each of the form:

0.00 1.05 nonspeech
1.05 3.55 speech
3.55 4.65 nonspeech
.
.
.

Script files 

It is also possible to specify the audio files and channels to be processed using a script file specified via the --scp flag. Currently, two script file formats are supported:

htk – HTK script file (default)
json – JSON script file

HTK script file

If --scp-fmt htk is specified, ldc-bpcsad will load the audio files to be segmented from an HTK script file. An HTK script file consists of a list of file paths, one path per line; e.g.:

/path/to/rec1.flac
/path/to/rec2.flac
/path/to/rec3.flac

For instance, if task.scp is the above HTK script file, then:

ldc-bpcsad --channel 1 --output-dir label_dir --scp-fmt htk --scp task.scp

is equivalent to:

ldc-bpcsad --channel 1 --output-dir label_dir /path/to/rec1.flac /path/to/rec2.flac /path/to/rec3.flac

JSON script file

If --scp-fmt json is specified, ldc-bpcsad will load the audio files AND channels to be segmented from a JSON file. The JSON file should consist of a sequence of JSON objects, each containing the following three key-value pairs:

audio_path – Path to audio file to perform SAD on.
channel – Channel number of audio file to perform SAD on (1-indexed).
channel_id – Basename for output file containing SAD result.

E.g.:

[{
    "audio_path": "/path/to/rec1.flac",
    "channel_id": "rec1_c1",
    "channel": 1
}, {
    "audio_path": "/path/to/rec1.flac",
    "channel_id": "rec1_c2",
    "channel": 2
}, {
    "audio_path": "/path/to/rec2.flac",
    "channel_id": "rec2_c1",
    "channel": 1
}]

For instance, if task.json is the above JSON file, then:

ldc-bpcsad --output-dir label_dir --scp-fmt json --scp task.json

will output the following three HTK label files to label_dir:

rec1_c1.lab – result of SAD for channel 1 of rec1.flac
rec1_c2.lab – result of SAD for channel 2 of rec1.flac
rec2_c1.lab – result of SAD for channel 1 of rec2.flac

Note

When using a JSON script file, the --channel flag has no effect.

Output formats 

The output file format for SAD output can be specified via the --output-fmt flag. Currently, four options are available:

htk – HTK label file (default)
rttm – Rich Transcription Time Marked (RTTM) file
audacity – Audacity label file
textgrid – Praat TextGrid

HTK label file

If --output-fmt htk is specified, SAD output will be stored as HTK label files. Each label file contains one segment per line, each line having the form:

<ONSET>\t<OFFSET>\t<LABEL>

where:

ONSET – onset of segment in seconds from beginning of recording
OFFSET – offset of segment in seconds from beginning of recording
LABEL – segment label; either “speech” or “nonspeech”

The segments are stored in order with the following guarantees:

the onset of the first segment is always 0
the offset of the final segment is always equal to the recording duration
the offset of segment n equals the onset of segment n+1.

E.g.:

00 1.05 nonspeech
05 3.55 speech
55 4.65 nonspeech

RTTM file

If --output-fmt rttm is specified, SAD output will be stored as Rich Transcription Time Marked (RTTM) files. Each RTTM file contains one speech segment per line, with each line having the form:

SPEAKER <FILE-ID> <CHANNEL> <ONSET> <DURATION> <NA> <NA> speaker <NA> <NA>

where:

FILE-ID – file name; the basename of the audio file that the turn is on, minus extension (e.g., rec1_a)
CHANNEL – the channel number of the turn on the audio file (1-indexed)
ONSET – onset of turn in seconds from beginning of recording
DURATION – duration of turn in seconds

E.g.:

SPEAKER rec1 1 1.05 2.50 <NA> <NA> speaker <NA> <NA>
SPEAKER rec1 1 4.00 3.31 <NA> <NA> speaker <NA> <NA>
SPEAKER rec1 1 10.11 4.15 <NA> <NA> speaker <NA> <NA>

Audacity label file

If --output-fmt audacity is specified, SAD output will be stored as Audacity label files . As we are not using any of the optional features of this file forma (e.g., frequency ranges), the resulting files are exactly identical to the HTK label files previously described and this is functionally an alias for --output-fmt htk except with a different file extension (HTK: .lab, Audacity: .txt).

Praat TextGrid

If --output-fmt textgrid is specified, SAD output will be stored as Praat TextGrid files. Each TextGrid file will contain a single IntervalTier named sad, consisting of a sequence of intervals whose attributes should be interpreted as follows:

xmin – onset of segment in seconds from beginning of recording
xmax – offset of segment in seconds from beginning of recording
text – segment label; either “speech” or “nonspeech”

E.g.:

File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0
xmax = 4.65
tiers? <exists>
size = 1
item []:
    item [1]:
        class = "IntervalTier"
        name = "sad"
        xmin = 0
        xmax = 5.0
        intervals: size = 3
        intervals [1]:
            xmin = 0
            xmax = 1.05
            text = "non-speech"
        intervals [2]:
            xmin = 1.05
            xmax = 3.55
            text = "speech"
        intervals [3]:
            xmin = 3.55
            xmax = 4.65
            text = "non-speech"

Postprocessing 

By default ldc-bpcsad postprocesses it’s output to eliminate speech segments less than 500 ms in duration and nonspeech segments less than 300 ms in duration. While these defaults are suitable for SAD that is being done as a precursor to transcription by human annotators, they may be overly restrictive for other uses. If necessary, the minimum speech and nonspeech segment durations may be changed via the --speech and --nonspeech flags. For instance, to instead use minimum durations of 250 ms for speech and 100 ms for nonspeech:

ldc-bpcsad --channel 1 --output-dir label_dir --speech 0.250 --nonspeech 0.100 rec1.flac rec2.flac rec3.flac

Audio formats 

This section describes the default supported input audio file formats. As audio IO is handled by the soundfile Python package, additional formats may be supported depending on your installed version of soundfile. To see if additional formats are supported, run:

ldc-bpcsad -h

and check the audio file formats list at the end of the help message.

Supported formats:

.aiff, .aif – AIFF (Apple/SGI)
.au, .snd – AU (Sun/NeXT)
.avr – AVR (Audio Visual Research)
.caf – CAF (Apple Core Audio File)
.flac – FLAC (Free Lossless Audio Codec)
.htk – HTK (HMM Tool Kit)
.iff – IFF (Amiga IFF/SVX8/SV16)
.mat, .mat4, .mat5 – Matlab 4.2/5.0 (GNU Octave 2.0/2.1)
.mpc – Musepack MPC (Akai MPC 2k)
.ogg, .vorbis – OGG Vorbis compressed audio
.paf, .fap – Ensoniq PARIS file format
.pvf – PVF (Portable Voice Format)
.rf64 – EBU RF64 enhancement of MBWF
.sd2 – Sound Designer 2 format
.sds – MIDI Sample Dump Standard
.sf – IRCAM SDIF (Institut de Recherche et Coordination Acoustique/Musique Sound Description Interchange Format)
.sph, .nist, .wav – NIST SPEHERE formatl SHORTEN compression is not supported
.voc – Sound Blaster VOC files
.w64 – Sonic Foundry 64-bit RIFF/WAV format
.wav – Microsoft .WAV RIFF format
.wve – Psion 8-bit A-law
.xi – Fasttracker 2 Extended Instrument format.

Debugging 

When ldc-bpcsad has a problem segmenting a file (e.g., bad file path, unsupported format, HTK error), it will log to STDERR that a problem was encountered and skip the file:

ldc-bpcsad --channel 1 --output-dir label_dir rec1.flac rec2.sph
SAD failed for channel 1 of "rec2.sph". Skipping. For more details rerun with the --debug flag.

To troubleshoot precisely why a file was skipped, rerun with the -debug flag. This will enable debug mode, which produces MUCH more voluminous output including, among other items:

whether or not the audio file exists and is in an understood format
whether or not the selected channel exists on the audio file
basic properties of audio file (e.g., format, number of channels, duration)
any exceptions that arose during decoding

E.g.:

ldc-bpcsad --debug --channel 1 --output-dir label_dir rec1.flac rec2.sph

SAD failed for channel 1 of "rec1.flac". Skipping. For more details rerun with the --debug flag.

DEBUG: COMMAND LINE CALL: /usr/local/bin/ldc-bpcsad --debug --channel 1 --output-dir label_dir rec1.flac rec2.sph
DEBUG: Flag "--n-jobs" is ignored for debug mode. Using single-threaded implementation.
DEBUG: Progress bar is disabled for debug mode.
DEBUG:
DEBUG: ########################################################################
DEBUG: Attempting SAD.
DEBUG: ########################################################################
DEBUG: Source audio file: rec1.flac
samplerate: 16000 Hz
channels: 1
duration: 1e+01:4.000 min
format: FLAC (Free Lossless Audio Codec) [FLAC]
subtype: Signed 16 bit PCM [PCM_16]
endian: FILE
sections: 1
frames: 9664000
extra_info: """
    File : rec1.flac
    Length : 6680195
    FLAC Stream Metadata
      Channels    : 1
      Sample rate : 16000
      Frames      : 9664000
      Bit width   : 16
      Vorbis Comment Metadata
        comment      : Processed by SoX
      End
      """
DEBUG:
DEBUG: Source channel: 1.
DEBUG:
DEBUG: Decoding chunk: CHUNK_ONSET: 0.000, CHUNK_OFFSET: 604.000, CHUNK_DUR: 604.000
DEBUG: Saving SAD to "label_dir/rec1.lab".
DEBUG: Output file format: htk.
DEBUG: ########################################################################
DEBUG: Attempting SAD.
DEBUG: ########################################################################
DEBUG: Error opening 'rec2.sph': File contains data in an unimplemented format.
DEBUG: To see supported formats, run:
DEBUG:
DEBUG:     ldc-bpcsad --help
WARNING: SAD failed for channel 1 of "rec2.sph". Skipping. For more details rerun with the --debug flag.

Usage 

Perform speech activity detection on audio files.

usage: ldc-bpcsad [-h] [--channel CHAN] [--scp SCP] [--scp-fmt SCP-FMT]
                  [--output-dir OUTPUT-DIR] [--output-fmt OUTPUT-FMT]
                  [--speech SPEECH-DUR] [--nonspeech NONSPEECH-DUR]
                  [--speech-scale-factor SPEECH-SCALE] [--disable-progress]
                  [--debug] [--n-jobs INT] [--version]
                  [audio-path ...]

Positional Arguments

audio-path: audio files to be processed

Named Arguments

--channel

channel (1-indexed) to process on each audio file (Default: 1)

Default: 1

--scp

path to script file (Default: None)

--scp-fmt

Possible choices: htk, json

script file format (Default: “htk”)

Default: “htk”

--output-dir

output segmentations to OUTPUT-DIR (Default: current directory)

Default: /home/runner/work/ldc-bpcsad/ldc-bpcsad/docs

--output-fmt

Possible choices: audacity, htk, rttm, textgrid

output file format (Default: “htk”)

Default: “htk”

--speech

filter speech segments shorter than SPEECH-DUR seconds (Default: 0.5)

Default: 0.5

--nonspeech

merge speech segments separated by less than NONSPEECH-DUR seconds (Default: 0.3)

Default: 0.3

--speech-scale-factor

post-multiply speech model acoustic likelihoods by SPEECH-SCALE (Default: 1.0)

Default: 1.0

--disable-progress

disable progress bar

Default: False

--debug

enable DEBUG mode

Default: False

--n-jobs, -j

set num threads to use (Default: 1)

Default: 1

--version

show program’s version number and exit

audio file formats: AIFF (Apple/SGI), AU (Sun/NeXT), AVR (Audio Visual Research), CAF (Apple Core Audio File), FLAC (Free Lossless Audio Codec), HTK (HMM Tool Kit), IFF (Amiga IFF/SVX8/SV16), MAT4 (GNU Octave 2.0 / Matlab 4.2), MAT5 (GNU Octave 2.1 / Matlab 5.0), MPC (Akai MPC 2k), OGG (OGG Container format), PAF (Ensoniq PARIS), PVF (Portable Voice Format), RAW (header-less), RF64 (RIFF 64), SD2 (Sound Designer II), SDS (Midi Sample Dump Standard), SF (Berkeley/IRCAM/CARL), VOC (Creative Labs), W64 (SoundFoundry WAVE 64), WAV (Microsoft), WAV (NIST Sphere), WAVEX (Microsoft), WVE (Psion Series 3), XI (FastTracker 2)