aaltoasr-rec / aaltoasr-align User's Guide

Introduction

aaltoasr-rec and aaltoasr-align are two command-line scripts for using the AaltoASR tools for simple speech recognition and segmentation (forced-alignment) tasks. aaltoasr-adapt provides additional rudimentary speaker adaptation support.

To use the tools, load them in your path with module load aaltoasr. (On the Aalto SPA system, use module load aaltoasr-rec instead.)

Notes and disclaimers

The segmentation (and recognition) work best for input files of moderate length; most preferrably, a single utterance. It is possible to use longer files, but the results may vary. Unfortunately, there's no support for automatically splitting a longer audio file with aaltoasr-align, as there would be no way to do the corresponding splitting on the input transcript. For aaltoasr-rec, the -s argument can be used to automatically split the file to segments of desired size, in order to take advantage of parallel processing.

The acoustic models use cross-word triphone models (i.e., each phoneme can have different models depending on the surrounding context), trained using the conventional maximum-likelihood scheme. As the training does not (deliberately) focus on segmentation ability, depending on the context, the phoneme boundaries could be very far indeed from their "linguistically correct" positions.

Possible future add-on project: discriminative training of a monophone model with a segmentation-related criterion.

Similarly, the statistical morphemes (generated with the Morfessor method) are inspired by the MDL principle, and make no pretense of being any sort of linguistic construct.

The speed/accuracy tradeoff of recognition can be controlled by various parameters. The default values are set to favour accuracy over speed, so (depending on the input signal) recognition can easily take up to 20 times as much time as the length of the input audio.

Usage examples

Recognize the speech in the audio file speech.wav:

aaltoasr-rec speech.wav

Recognize the speech of a long audio file speech.wav; speed up the process by splitting the file into approximately 5-minute segments and running the recognition on four cores in parallel:

aaltoasr-rec -s 300 -n 4 speech.wav

Recognize the speech in speech.wav, but also generate all supported levels of segmentation; write output in speech.txt and the segmentations additionally in TextGrid format to speech.textgrid:

aaltoasr-rec -m trans,segword,segmorph,segphone \
  -o speech.txt -T speech.textgrid `speech.wav`

Given the speech recording speech.wav and a corresponding plaintext transcription in speech.txt, produce a TextGrid alignment in speech.textgrid (along with a word-level alignment to standard output):

aaltoasr-align -T `speech.textgrid` -t `speech.txt` `speech.wav`

Recognize or align multiple files at once, on (up to) 4 cores:

aaltoasr-rec -n 4 speech1.wav speech2.wav
aaltoasr-align -n 4 -t speech1.txt -t speech2.txt speech1.wav speech2.wav

See the section "Adaptation" for examples of aaltoasr-adapt use.

Option reference

The aaltoasr-rec and aaltoasr-align tools, for the most part, share the same command line arguments. Differences have been noted in the descriptions of individual arguments.

The overall command lines have the form:

aaltoasr-rec [options] input [input ...]
aaltoasr-align [options] -t transcript [-t transcript ...] input [input ...]

The input file can be in any format accepted by the sox utility; type sox -h | grep "FILE FORMATS" (after loading the aaltoasr module) for a list.

The following options are available:

Adaptation

The aaltoasr scripts support a limited form of acoustic model adaptation, which can be helpful if the speaker (or recording environment) differ much from what's expected. To use the adaptation, train a profile with aaltoasr-adapt, and then use that with aaltoasr-rec or aaltoasr-align as follows:

aaltoasr-adapt -o speaker.conf -t training.txt training.wav
aaltoasr-rec -a speaker.conf test.wav

For a single test file with unknown contents, it is also possible to do unsupervised adaptation with a two-pass recognition process:

aaltoasr-adapt -o speaker.conf test.wav
aaltoasr-rec -a speaker.conf test.wav

Similarly, it is possible to do a two-pass alignment as follows:

aaltoasr-adapt -o speaker.conf -t test.txt test.wav
aaltoasr-align -a speaker.conf -t test.txt test.wav

Finally, a full three-pass alignment (for details, see below) can be done with:

aaltoasr-adapt -o pass1.conf -t test.txt test.wav
aaltoasr-adapt -p pass1.conf -m -o pass2.conf -t test.txt test.wav
aaltoasr-align -a pass2.conf -t test.txt test.wav

Two adaptation styles are supported: a single, global feature-domain CMLLR transformation (used by default), and more detailed model-based regression tree Gaussian CMLLR adaptation (with the -m flag). The former needs very little (few seconds) of adaptation data, while the latter can produce a better-adapted model, if sufficient data is available. Both styles can be applied in supervised (transcript of adaptation data is known) or unsupervised (contents of adaptation data are unknown) mode.

Supervised adaptation is implemented by aligning the transcript and the audio with aaltoasr-align, while for unsupervised adaptation the aaltoasr-rec is used instead. Whether unsupervised adaptation is beneficial at all depends somewhat on the quality of this initial recognition step.

Multiple adaptation input files can be used in a manner consistent with aaltoasr-rec/aaltoasr-align. If a transcript is provided with the -t parameter to aaltoasr-adapt, supervised adaptation will be done; if the -t parameter is not used, the adaptation is unsupervised.

You can also perform multipass adaptation, by providing a previously generated speaker.conf to aaltoasr-adapt using the -p flag. The specified adaptation will be used when aligning or recognizing the adaptation data. A typical way of using this was shown as the last example: training a single global transformation first, then using that to realign the adaptation data and train a more detailed model-based adaptation.

The aaltoasr-adapt script knows of the following options:

Technical details

The individual executables of the AaltoASR tools support a number of features not covered by this guide. The tools can be found in the bin directory sibling to the scripts directory; type which aaltoasr-rec to locate them. The Github page will hopefully at some point contain detailed documentation of them.

A reasonable way to use custom acoustic models is to make a local copy of the (very short) aaltoasr-rec (or aaltoasr-align) script, and have it modify the aaltoasr.models list before initializing the AaltoASR object. Absolute paths can be used.