Speech, natural language understanding, and dialog resources

The resources below have been collected by members of the AVIOS board for those who would like to build speech applications or learn more about speech recogntion. If you know of other resoure that should be included, please contact board member Marie Meteer. Note that inclusion is not an endorsement by AVOIS.

Speech Recognition in the browser

Three Strategies

  • Speech manually captured and processed in the browser (Firefox and Chrome, using pocketsphinx.js)

  • Speech automatically captured by the browser and processed in the cloud (Chrome WebSpeech API)

  • Speech manually captured by the browser and processed in the cloud (wit.ai, api.ai,IBM, Nuance NDEV)

 

pocketsphinx.js

  • local processing in the browser

  • based on CMU Pocketsphinx

  • available in browsers that support Web Audio

  • grammar only (JSGF format) no dictation

  • English and Mandarin

Web Speech API

  • WebSpeech API in Chrome, Since Chrome version 31

  • most Chrome platforms are supported now

  • dictation only (no grammars)

  • find out about support at Can I Use?

Some resources:

English and Mandarin

Manual Server-based Speech Recognition

“Manual” means that the developer is responsible for capturing the audio and sending it to a server for processing

 

To capture speech in the browser:

  • HTML5: (getUserMedia())

  • WebRTCis a free, open project that provides browsers and mobile applications with Real-Time Communications (RTC) capabilities via simple APIs.

 

Text-to-Speech (TTS)

TTS in the browser

Server-based TTS

Natural language understanding

 

Combinations

Native (non-browser) Mobile Speech Recognition and TTS

Many of the browser-based tools described above also include versions for native OS’s

 

Dialog Processing

OpenDial

 

CMU Ravenclaw

 

Speech Recognition Development

Language Modeling Toolkits

SRI Toolkit

 

Open Source Speech Recognizers

KALDI
Kaldi is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. Kaldi is intended for use by speech recognition researchers.

CMU Sphinx
CMUSphinx represents over 20 years of CMU research, with state of art speech recognition algorithms for efficient speech recognition. CMUSphinx tools are designed specifically for low-resource platforms

 

Evaluting speech recognition

NIST SCLite
Asclite is multi-dimensional extension of the Dynamic Programming solution to Levenshtein Edit Distance calculations capable of evaluating STT and SASTT systems during periods of overlapping, simultaneous speech

 

Other tools

Sox: The “Swiss Army Knife” of audio processing

LDC’s original Transcriber

  • w-facebook
  • Twitter Clean