Does CAT define the difference between steno and voice?
Ten years ago, there was a clear difference between the product produced by a machine reporter and a mask reporter. Machine reporters had CAT systems that could produce realtime. Mask reporters didn't. Now, as the debate about admitting mask reporters as members rages through NCRA, things have changed.
Mask reporting was born in the 1940s, when Horace Webb, a Gregg shorthand writer from Chicago, built the first usable silenced microphone. Back then, court reporters (both pen and machine) often took down the proceedings and then dictated them into a tape recorder to be transcribed. The concept behind mask reporting is a simple one: eliminate the shorthand and do the dictation during the proceedings. The problem Webb faced was that he needed to speak clearly enough to produce a clean tape, while being quiet enough to be unnoticed.
The mask is simply a set of sound baffles that fit around the microphone and set snugly against the mask reporter's face. The baffles prevent sound from echoing inside the mask-which distorts the recorded sound-and prevent people nearby from hearing the dictation. The basic design of the mask has not changed since the '40s.
First, let's clarify some terminology. Computers that interpret or transcribe human speech are performing speech recognition. Computers that identify people by their voice patterns are performing voice recognition. The two terms are not interchangeable and voice recognition technology is currently not used in the captioning business.
Stenotype reporting works by breaking speech down into chunks called "strokes." Each stroke roughly corresponds to a syllable, and the system is basically phonetic. The translation engine in a CAT system converts strokes into either text or commands, so a court reporter can define strokes to perform editing functions or insert include files into the text.
Speech recognition works by breaking speech down into chunks called "phonemes." Each phoneme corresponds to a sound, and the system is strictly phonetic. The speech engine in a speech recognition system converts phonemes into text or commands, just like a machine writer's CAT system.
Early speech recognition systems required discrete input. Each word had to be spoken separately for the system to work. Slurring two words together, as English speakers do, made the speech incomprehensible to the program. People can be trained to speak...with...a...pause...between...words, but it feels unnatural, and it places serious limitations on the input speed. It's virtually impossible to separate words when speaking at 150 words per minute or more.
Continuous speech recognition, on the other hand, works on speech the way we actually talk. It requires massive amounts of computer power by the standards of the 1980s, but we can buy that kind of computer power at any corner electronics store for $1,000 these days. While discrete recognition falls apart at higher speeds, continuous recognition actually improves at normal speaking rates, making it the obvious candidate for mask reporters.
The Holy Grail of speech recognition is speaker independence: a system that functions no matter who is speaking or whether the system has been trained for them or not. With such a system, a microphone (or series of microphones) could be dropped into a courtroom or deposition suite, and text would flow out of the computer with no human intervention.
There is no such beast today. Add the requirement that the text be not only accurate, but correctly punctuated with reliable speaker identification, and the engineers I've interviewed almost unanimously agree that it is decades away at least, and may not happen in our lifetimes.
What we have is speaker-dependent speech recognition: systems that must be "trained" for one individual speaker. Again, the obvious candidate for a mask reporter's CAT system.
With the advent of usable speech recognition systems, the mask reporting profession felt it was time for an identity change. According to the National Verbatim Reporters Association Web site (www.nvra.org), "voice writing" describes "the method of court reporting...fo rmerly called 'stenomask.'" Even though the change in name was prompted by the new tools, even mask reporters using tape recorders are now called voice writers.
For this article, however, we'll focus strictly on voice writers that use speech recognition systems, and not on simple recording and playback systems.
Machine reporters work with CAT software that accepts input from a stenotype keyboard and translates it into text. Voice writers work with CAT software that accepts input from a microphone and translates it into text.
Tape recorders and digital recorders are not a required part of either system, although both machine reporters and voice writers have the ability to record ambient sound as a backup and synchronize it with their record.
Both types of CAT software can be trained to handle inputs as more than just words. To start a new paragraph with speaker ID for a specific attorney, a machine writer might write STKPWHRAO, and a voice writer might say "Spee-one." Anything a machine writer defines with strokes, a voice writer defines with phonemes.
When comparing machine -based CAT with voice-based CAT, we naturally tend to look for the differences rather than the similarities. The systems have identical goals: accept input from the court reporter and produce a transcript. One uses a microphone and the other uses a stenotype machine. Is that all that's different? No.
The machine shorthand flowing from a steno machine is purely digital. One FPLT is just like every other FPLT. Either you pushed that key or you didn't. Either you correctly stroked that word or you didn't. When a machine reporter defines a stroke, that stroke will translate correctly every single time the reporter writes it. Only when a stroke is not recognized will the CAT system's automated features step in and attempt to fix the input.
The sound waves flowing from a voice writer's microphone to the computer are analog. Speech-based CAT software doesn't deal with certainties. It deals with proba bilities. A speech engine might spit out a given piece of text only when it is at least 87% sure that's what the reporter said. It's not really that simple, and that specific percentage was just used as an example. The point, however, is that a machine writer's CAT software is always 100% sure of what the reporter wrote. A voice writer's CAT software is always interpreting and basing decisions on probabilities.
We do this subconsciously all the time. If we're not sure what we heard, we pick out the most likely words--what we thought we heard--and plug in the one that fits the context best. It's similar to automatic conflict resolution in machine CAT, but far more sophisticated.
Because of this extra layer of complexity the analog input causes and the late start CAT got in the mask reporting field, machine writers' CAT systems have been several years ahead in refinements and adoption of new concepts. Today, however, the difference between them has blurred.
Flip through this issue of the JCR. You'll see ads from a familiar collection of CAT companies that have been advertising here for years. Machine reporters have had a variety of choices in CAT software. In the end, they'll all produce the same transcript, but they have different ways of getting there. Each vendor has a loyal following, and each steno-based CAT program has its advantages and disadvantages.
In the voice writing world, choices have been more limited. In 2002, a voice writer could choose between AudioScribe and StenoScribe. That was it, although voice writers interested in realtime captioning also had the ISIS software from Voice to Text, LLC, and a variety of new programs coming out from offline captioning companies.
In 2003, things changed significantly with the announcement that two machine CAT vendors were adding speech recognition capabilities to their systems. Today, voice writers can buy the ProCAT software or the Eclipse software (which uses the SpeechGATE interface from AudioScribe)-two of the same packages that have been popular with machine reporters for well over a decade. Other vendors of steno-based CAT software are likely to release voice interfaces soon as well.
The input mechanism is different. The editing software is the same. The final transcripts are the same. But is realtime accuracy the same?
There is one fundamental difference between machine CAT and speech CAT that dramatically affects accuracy. A machine writer can switch to a different system and achieve equivalent accuracy immediately. Just save your translation dictionary in RTF/CRE format from the old system, load it into your new system, tweak it for a few hours, and you're off and running.
Voice writers must spend large amounts of time training their systems, and moving to a different software package requires starting over. Changing CAT vendors is not a casual decision for a voice writer. Even the most accomplished writers will produce error-laden output the first time they sit down with a new system. After a training period where the voice writer and the CAT system get used to each other, accuracy improves markedly. The longer the software is trained, the better the accuracy gets. This is analogous to developing a CAT dictionary, but the voice writer's "dictionary" isn't portable.
Let's compare apples and apples, though.
Consider two fictional reporters. Bob is a voice writer that has passed the RVR exam (NVRA's realtime certification) and has been working with the same voice-based CAT system for two years. Sue is a machine writer that has passed the CRR exam and has been on her CAT software for two years. Both are skilled at their jobs. Can Bob produce realtime text equivalent in accuracy and quality to what Sue produces?
A few years ago, the answer would have been determined by their technology. Speech-based CAT software just wasn't up to the level of machine-based CAT software. Speech recognition software hadn't been "tweaked in" enough, and the average notebook computer wasn't fast enough. No matter how good Bob was, he'd be hard-pressed to match Sue's realtime.
Today, the answer depends on Bob and Sue. Requirements for the RVR and the CRR are almost identical. Speech-based CAT systems have matured dramatically. If Bob's dictation is clear enough, smooth enough, and fast enough, his software will allow him to match Sue's quality. If his realtime isn't as good as Sue's, he can't blame his CAT system.
As I write this article in February 2004, voice writers are producing realtime in courts and in depositions. They can send their output to LiveNote, Summation, or CaseView. The technology is real.
Voice writers doing captioning work h ave an additional advantage. They can work with a headset microphone rather than a mask. Since that's what today's speech engines were designed for, it boosts quality while decreasing fatigue. Oddly, during the research for my new captioning book, I was unable to find a single television show anywhere in the country that's regularly captioned by a voice writer. Realtime captioning by voice writers was limited to trial runs, demonstrations, and a few single shows.
CAT systems for voice writers are not a panacea. A voice writer has to be able to speak clearly and insert punctuation and speaker IDs. Just as a fumble-fingered individual will never become a machine writer, someone who mumbles , stutters, or slurs won't become a voice writer.
Twenty years ago, a machine reporter had to be extraordinarily talented to produce clean realtime. A reporter fresh out of school had to struggle against software that was crude by today's standards, and had probably been taught a reporting theory laden with conflicts and word-boundary problems. Today, mundane formatting tasks are handled automatically by the software. The program can compensate-to some extent-for the reporter's mistakes. Editing that used to take several strokes can be done with one, or even built into a dictionary entry so it takes no extra effort at all.
CAT systems for voice writers are still catching up. A new voice writer of moderate skill will produce only moderate realtime. Less than a dozen voice writers have passed the RVR exam, largely because it takes a high degree of skill to pull top-notch realtime out of the speech recognition systems. Hand them a new system today, and they'll have to spend six months training it before they're back at their current level.
But the way has been paved. Software developers are taking their experience from machine CAT and applying it now to voice CAT. Improvements in the technology are happening faster than many working reporters can keep up.
Will there continue to be a difference between what steno reporters do and what voice writers do? Absolutely.
Will it be the CAT software that defines that difference? Not anymore.