Author's note September 1999: In addition to The Circuit Rider (the magazine of the United States Court Reporters Association), this article also appeared in the September/October 1995 issue of the Stenomask Reporter. Although this article was written four years ago, I believe that all of my conclusions are still accurate, including the one that mask reporters would be using speech recognition systems by now.
Author's note May 2004: Mask reporters (now called "voice writers" are using speech recognition systems today. There's a whole chapter about it in my book, The Closed Captioning Handbook.
The popular media often picks up tales from speech recognition technology firms of how court reporting will be an obsolete profession any day now. To those not up-to-speed on the technology, this is a frightening thought. Prospective students are worried about whether starting court reporting school is a good idea. Existing students are wondering whether to stay.
Just how good IS speech recognition these days?
I attended ASAT '95 in San Francisco the first week of April. It's a Speech Technology (both recognition and synthesis) show. I visited every booth they had and really checked out the state of the art from some of the companies developing the underlying technologies for tomorrow's speech recognition systems - folks like Phillips and AT&T.
The products and technologies categorize in the following ways:
SPEECH SYNTHESIS is the creation of speech electronically. In other words, making a computer talk. This isn't of much interest to the court reporting profession, where the objective is to turn speech to print rather than vice-versa.
SPEECH RECOGNITION is computer comprehension of speech. There are several broad classifications of speech recognition, including discrete speech vs. continuous speech, speaker-dependent vs. speaker-independent, and context-sensitive vs. Context-insensitive.
DISCRETE SPEECH RECOGNITION requires that each word be an individually identifiable unit. Obviously, this isn't the way we talk. During normal speech, words are run together, and even slurred (as in "gonna" for "going to"). To make speech recognition easier, many systems require a pause between words. A typical requirement is 100 milliseconds (one tenth of a second). Given that it takes about 2/10 of a second to say a typical word, that puts a theoretical maximum of 200 words per minute on discrete speech recognition. Needless to say, it will never work in a courtroom!
CONTINUOUS SPEECH RECOGNITION is the hook everyone's hanging their hat on. This is the ability to recognize words exactly as they're spoken, slurs and all. The system uses a technology known as the "Hidden Markov Model" to separate the words into phonemes (individual sounds), and then reassembles them into words.
SPEAKER-DEPENDENT systems are trained for a single voice. This is the technology used by stenomaskers. The system is trained to understand their pronunciations, inflections, and accents, and can run much more efficiently and accurately because it is tailored to the speaker. This is analogous to the way a CAT system is trained to a specific court reporter using dictionaries, phonics tables, theory sheets, and include files.
SPEAKER-INDEPENDENT systems are designed to deal with anyone, as long as they're speaking English. To do this, the scientists had to figure out what parts of speech are generic, and which ones vary from person to person. A spin-off of this speech recognition technology is that the speaker-dependent parts have now been programmed into security systems which respond only to a given individual's voice, as shown in movies like "Sneakers."
CONTEXT-SENSITIVE systems increase their accuracy by anticipating or limiting what can be said at any given time. For example, a speech-recognition-based hotel wake-up call system might ask you what time you'd like to be awakened. It can then assume that whatever you say will represent a time of day. If you say anything else, it will not be able to recognize it. Context-sensitive systems may actually have large vocabularies, but only a small portion of that vocabulary will be activated at a time.
CONTEXT-INSENSITIVE systems allow you to say anything, any time. Typically, they have dictionaries in the neighborhood of 20,0 00 words.
Now, for some observations from the show:
Overall, I would rate the tools available for speech-based editing as "marginally useful." I would rate full-speed realtime speaker-independent speech recognition as a long way from being able to compete with court reporting technology. I asked one vendor (who claims to be the leader in the core technology, as several of them do) when we'd be able to plug a system into a television newscast and have it create captions in realtime. He said "two years." I asked how long before it could be done on a PC-type computer (they're using $20,000 Silicon Graphics workstations). He said we'd have to wait until PC's were far more powerful than today's Pentiums and Power PCs. I asked how long before it could get ALL of the words in the newscast, including live interviews and remotes. He said "never."
When the court reporting industry does see competition from these technologies (which it will, sooner or later), it will probably come from mask reporters first. These folks are trained to speak clearly and distinctly already, and my guess is that they will probably be able to achieve reasonable results within the next few years. By "reasonable," I 'm speaking of error levels similar to what it takes to pass the CRR (4% or less).