Computer Code Voice Transcription
Sean W Hennessy
The Scripps Research Institute
Writing computer code by speaking has been considered by many people a subject of science fiction. The ability to code by speaking instead of typing is very attractive, especially for people that have repetitive stress injuries caused by the computer keyboard. We present here the design of the first system that enables programming by voice rather than by hands. One implementation of the system, which we call Happy Hands, is a commercial product under the name The Happy Hands Java Speech Editor [1].
Prior work
in the field of computer document transcription via speech recognition consists
of simple applications for writing common English text documents, such as the SpeakPad [2] that comes with Via Voice. These systems use the entire English language
as the recognition target. In contrast, Happy
Hands uses specialized speech grammars derived in the context of the
computer code as the recognition target.
Additionally, these prior systems output results in unformatted strings
of words. In contrast, the output of
Happy Hands is properly formatted and punctuated computer code. Computer code dictation in particular has
been considered in theory. For example, “Spoken-word
direction of computer program synthesis” by Alvin J. Surkan
[5] presents ways to setup vocabularies and “agents” that direct the creation
of computer code. Second, “Programming
by Voice, VocalProgramming” by Stephen C. Arnold [6],
touches on the idea of code insertion but not to the level of development of
the system described here.
The
development of Happy Hands was guided by five main goals. The first and foremost was to reduce
repetitive stress injuries caused by the computer keyboard. Towards this end, a number of related goals
have been pursued. The second goal was
to raise the user's level of thought from the detailed syntax of the code to a
higher level idea of what he wishes to accomplish. The third goal was that the system be context
sensitive. We wanted to be able to speak
every element of the computer code, without having to edit it afterwards by
hand. This means that symbols such as method
names needed to be recognizable from speech and needed to come into the
resulting code directly. The fourth goal
was make the editing of existing code as easy as creating new code. This includes operations such as removing a
specified block of code, and doing a copy paste operation via voice, and
replacing a sub-expression with new code.
The fifth goal was to allow code to be entered with the keyboard. A few editing patterns seem to lie more in
the domain of the text buffer rather than the logical meaning of the code,
therefore the system had resolve both speech input and keyboard input to a
common state, smoothly and unobtrusively.
This proved to be one of the most challenging parts of the design.
Design
Happy Hands
is organized into units that combine to make the large scale data flow. (Fig 1.)
(1) Speech recognizer
The speech recognizer translates voice input into individual tokens. Happy Hands makes use of off-the-shelf speech recognition engines. Both the ViaVoice recognizer from IBM [3] and the Microsoft recognizer version 4 [4] work well.
Present day
speech recognition engines use two systems for identifying words. In one system, the recognizer assumes the
user is speaking in a conversational style and the recognizer will match
segments of audio data against any word in the entire English language, giving
a list of words a result. In the other system
the recognizer is configured with a more limited grammar consisting of a tree
of rule alternatives, rule sequences, rule counts and tokens. (This is similar to the widely known regular
expressions for pattern identification in a document.) The rule grammar target is a better choice for
speech coding because of the formal nature of computer code, in which only declared
symbols, keywords, operators, and punctuation may appear. Assembling a grammar composed of these
symbols, keywords, and operators defines a language enabling one to code by
speaking.
Happy Hands uses a syntax tree as the central data
structure that keeps the state of the user’s work. The two principal advantages of the syntax
tree instead over an ordinary text buffer are:
(a)
The syntax tree enables the transcriber units to deal with the data at the
level of its meaning instead of its representation in human readable form.
(b) Every syntax tree node has a location and size in the document, making it a convenient unit of selection for editing operations such as replacing the selected node with another, inserting a node before or after the selected node, and making changes to attributes associated with the selected node. Changes are reflected in the node’s text buffer rendition.
(3) Text
buffer
Happy Hands uses a traditional text buffer unit to store
the text form of the computer code. The
text buffer is kept in addition to the syntax tree to enable the user to write code
using the keyboard. The keyboard is
required for entering new symbol names and string literals, and it is comforting
to be able to fall back to the traditional input method any time.
(4) Text generator
The text generator transforms the syntax tree into the computer code text. Because changes to the syntax tree are made continuously as the user speaks, the text is generated from a small sub-tree rather than the entire syntax tree. When it is asked to update the text buffer, the text generator finds the common parent of all the changed syntax tree nodes. Starting at this node it recursively descends into the syntax tree, emitting text into a temporary buffer. Upon completion, the text generator removes the old text associated with the changed tree node and inserts in its place the newly generated text.
The text parser unit is responsible for transforming the computer code text into the syntax tree. Happy Hands quickly derives the syntax tree from the text by incrementally reparsing small sections of the text document instead of always parsing the entire document, allowing the user to type. The text parser unit is also invoked when a new file is opened from disk.
As the user types between reparse computations, Happy Hands tracks changes to the syntax tree node sizes so that when it comes time to regenerate the text from the changed syntax tree nodes, the syntax tree nodes have stored their correct sizes so that the correct segment of the text buffer is removed and replaced. The parser is invoked when the user appears to have finished typing for the moment.
The text parser keeps the sizes and positions of the elements it finds in the text, with a convention matching that of the text generator. The small syntax tree replaces the old tree enclosing the reparsed text, overlaying the text positions and sizes perfectly. Optionally, the text generator can be invoked on the small syntax tree to ensure consistent formatting, although this behavior can disturb the user’s typing.
(6) Context analyzer
The context analyzer builds lists of symbol names by analyzing the syntax tree of the source code file with which the user is working. It also analyzes source code and compiled files to which the edited file refers. For example, during the transcription the JAVA programming language, the context analyzer keeps lists of field names, variable names, method names, type names, and package names. It caches its findings, and monitors changes to symbol lists to make intelligent decisions regarding when speech grammars need to be regenerated.
The transcriber units set up the speech recognizer engine to listen for and respond to speech recognition events that identify their specialized grammars. There are also other units similar in design handle tasks not related to emitting computer code: these units handle such tasks as listening to the user’s request to open files and running the compiler.
The transcriber units are the main user interface to Happy Hands. There is a different type of transcriber unit for the transcription of every significant type of computer code. A different transcriber unit is chosen depending on the type of computer code element the user is editing. The active transcriber is changed in response to the addition of a new code element, or the selection of an existing element. A few of the many types of transcribers in Happy Hands are: a transcriber for writing class definitions, a transcriber for writing method definitions, a transcriber for writing for loops, and a transcriber for writing expressions.
A transcriber’s first job is to generate grammars for the speech recognizer to match against incoming audio data. The collective set of these grammars define the spoken language that Happy Hands understands. The recognizer will respond only to speech that is in the pattern of one of these grammars. The grammars are designed to resemble English sentences. Each transcriber sets up the recognizer with grammars specialized for the transcription type of code it handles. The grammars have essentially a constant overlying structure with one or more variables sections that are populated by symbol names found within the current scope of the computer code. The context analyzer unit is consulted to return symbol name lists and some common sub-grammars, called grammar rules.
The speech recognizer sends speech recognition events to the transcribers in the form token sequences. The transcribers respond by making changes to the syntax tree. The transcriber analyzes a token sequence to determine what the tokens mean. It then inserts, removes, or changes elements of the syntax tree according to the pattern matched in the token sequence. A change the syntax tree triggers the text generator unit to update the text buffer unit.
Upon changes to the document via voice, via keyboard, or a change in insertion point, the grammars that have become out of date are updated to follow the new document. Grammars are updated asynchronously, meaning grammars that become invalid are not immediately recomputed, but instead are queued and updated after the controlling logic determines that the document context has become stable. In the case of the JAVA programming language, typical changes requiring grammar updates include addition of declared variables, member fields, member methods, import statements, or a change of scope.
Symbol names found within the
computer code are often renamed within a grammar to be spoken more easily. Conversions include breaking compound words
into multiple words, and changing non English words into sequences of
independent letters. Upon recognition,
these modified names are remapped back to the original symbol names found
within the code.
Happy Hands has a graphical text editor to present of the computer code and for editing via the keyboard. This component represents the syntax tree as well by overlaying rectangles representing the bounds of syntax tree elements on the code. These rectangles indicate which element of the syntax tree has the focus for speech input and which element has the focus for copy operations.
The computer’s mouse (or other
pointing device) is used to select different elements in the syntax tree. The selected elements are then highlighted
with a surrounding rectangle and become the target of code transcription. Happy Hands has one special feature in this
regard; in many cases is not necessary to press the mouse button to cause a
selection change, which is useful because repetitive presses of mouse buttons
can aggravate hand problems. If the syntax
tree node under the mouse position is the same type as the syntax tree node
selected for transcription, a transcriber change requires painlessly moving the
mouse. Otherwise, the user must press
the mouse button to change the active transcriber type. And, an independently selected element that defines
the copy/paste and new element insertion anchor is always changed with a simple
mouse move.
Creation and Editing
Happy Hands is designed to facilitate both the creation of
new code and the changing of existing code. One element of the syntax tree is always
maintained as the currently selected element.
This element is the target for speech recognition. Some spoken commands are interpreted to
replace the current selection with a replacement element of comparable type. Other commands insert or append code elements
in a position with respect to the current selection. Four methods are used for coding by speech
input:
(a) The current selection can be replaced with a new section of code.
(b) The
current selection can be modified. This
method is used when speech input dictates such things as a type modifier or
access modifier of a variable declaration, or the addition of an argument
within a method arguments list.
(c) New code can be inserted before, after, around or within the current selection.
(d) Cut and paste operations.
This approach unifies the addition of new code and the editing of existing code.
Happy Hands uses a rectangle cursor that has both a position and a character range to indicate the selected element. It is rendered graphically as a rectangle surrounding the selected element (syntax tree node). Fig 2 and 3 show the cursor with []’s.
Elements that are direct children of a block are inserted into the code with respect to the smallest direct block child that encompasses the currently selected element. For example, as shown in Fig 3., a while loop can be added by saying “append while loop” or “insert while loop”. The selection then changes to the “test expression element” of the “while loop element”, enabling the user to specify the expression, as shown in Fig 3. In general, when adding a statement to a block, the current selection is automatically changed to a sub element, which is then ready to be modified as needed.
Expressions are assembled by replacing the current selection with the sub expression given by the spoken phrase. The default expression starts out as a single identifier called “no_value”. The expression grows from here when a phrase indicating a binary operator is spoken. For example, saying “addition” will replace “no_value” with
“no_value + no_value”. Speaking a phrase indicating an identifier name will replace one of these “no_value’s” with a valid variable name, field name or method invocation. The current selection is moved automatically to unspecified expression elements, enabling the user to string expression generation phrases. The selection cursor dances down the expression as the expression is spoken. This process is described in Fig 2.
Performance Evaluation
We have analyzed the speed and reliability of Happy Hands. The program was tested by the program’s author himself. The test assumes the speech recognizer has been trained, and that the user is well practiced in the use of Happy Hands.
To test the reliability, we tracked the
recognition accuracy during the dictation of a short pre-written program of 80
lines. A data point was taken after
every spoken phrase. Recognition
accuracy results are divided into four categories:
1) the recognizer understood the
spoken phrase perfectly - the best case
2) the recognizer didn't understand
with sufficient confidence to respond, so that the user needs repeat the
statement
3) the recognizer misunderstood the
spoken phrase but responded, resulting in unintended code
4) the recognizer partially
misunderstood the spoken phrase, which usually results in the matching of some
imagined tokens at the end, leading to partially unintended code
In the first test, 125 spoken phrases were made during the course of the dictation. Ninety-nine were recognized perfectly, 24 were ignored, and 2 were misrecognized. We think these results are more than good enough to establish user confidence in the system. These results show both the quality of the speech recognition engine and the design of the spoken language. Our spoken language is designed to force the user to say rather verbose phases. This increases the recognition rate by giving the recognizer a longer, more featured audio profile to analyze. Another aspect of the design that affects recognition accuracy is the number of grammars that are available at any one time and the size of these grammars. Happy Hands keeps several grammars available for use at anytime, whereas others are turned on and off according to the current insert point in the code. This system strikes a good balance between recognition accuracy and convenience for the user.
|
|
Perfect match |
No result |
All wrong |
Trailing |
|
New code, ViaVoice |
99 events |
24 events |
2 events |
0 events |
Table 1. Recognition reliability
To compare
the relative speeds of coding by speech recognition versus coding by typing, we
measured the overall time required to copy a JAVA source file. The file may be downloaded from [7] for
verification purposes. Coding by speech
recognition is about twice as slow coding by typing. (25 minutes vs. 12 minutes). However, these results do not tell the whole
story. A programmer rarely copies files
during coding. Instead, much of a
programmer’s time is spent planning the code, with short intermittent periods of
imputing the text. Our sample code actually took an hour to design
and write the first time. Realistically,
the total time of coding via speech recognition vs. coding by typing would be
70 minutes vs. 60 minutes. And in actual
practice, using the system is a fairly relaxing and satisfying task of speaking
a few lines of code, testing it, and speaking a few more lines and editing some
existing lines. Coding via speech in
fact doesn't seem to slow down a programmer at all. At the end of the day, his code is still
written, and beautifully formatted, and tested, and his hands aren’t any worse
for the wear.
|
|
Time with speech |
Time by typing |
|
Copying file, 80 lines |
25 minutes |
12 minutes |
Table 2: Coding speed
References
[1] The Happy Hands Java Speech Editor, http://www.h-dm.com/HappyHands
[2] IBM Speak Pad, IBM Corporation
[3] Via Voice, IBM Corporation
[4] Microsoft Speech Recognizer Engine, Microsoft Corporation
[5] Alvin J. Surkan, Spoken-word direction of computer program synthesis. Proceedings of the APL00 Conference
[6] Programming by
Voice, VocalProgramming by Stephen C. Arnold. Proceedings of the fourth international ACM
conference on Assistive technologies.
[7] Speed test code, http://www.h-dm.com/HappyHands/testing/acm_speed_test.java
buffer changes

Fig 1: Large
scale data flow
Speech recognition input and keyboard input are resolved to a single data structure – a syntax tree - that holds the state of the user’s file. Speech input results in a modification of the syntax tree, and then a modification of the text buffer. Keyboard input results in a modification of the text buffer, and then a modification of the syntax tree. Speech grammars are developed by analyzing the syntax tree and given to the recognizer.

Fig 2:
Expression synthesis
Expressions are built with Happy Hands by speaking symbol
names and operator names in any order.
Variable and field names are prefixed with “variable” and “field” to
make recognition more reliable. This
sequence assumes the variables apple, orange, avocado and grape have already
been declared, as Happy Hands understands only formally declared symbol
names. The expression begins as [no_value]. “Variable apple” causes the expression
transcriber to replace [no_value] with [apple],
then “assignment” causes [apple] to
be replaced with “apple = [no_value]”.
Generally, the selected element is replaced with the code given by the
spoken phrase. The entire phrase is
processed and rendered to the text buffer at once.

Fig 3. Block
element synthesis
The while loop is created with “append while loop”, or
“insert while loop”, or “enclose with while loop”. The selection is changed automatically to the
test expression. The expression is
spoken in the manner described in Fig 2.