Computer Code Voice Transcription

 

Sean W Hennessy

The Scripps Research Institute

 

 

Writing computer code by speaking has been considered by many people a subject of science fiction.  The ability to code by speaking instead of typing is very attractive, especially for people that have repetitive stress injuries caused by the computer keyboard.  We present here the design of the first system that enables programming by voice rather than by hands.  One implementation of the system, which we call Happy Hands, is a commercial product under the name The Happy Hands Java Speech Editor [1].

            Prior work in the field of computer document transcription via speech recognition consists of simple applications for writing common English text documents, such as the SpeakPad [2] that comes with Via Voice.  These systems use the entire English language as the recognition target. In contrast, Happy Hands uses specialized speech grammars derived in the context of the computer code as the recognition target.  Additionally, these prior systems output results in unformatted strings of words.  In contrast, the output of Happy Hands is properly formatted and punctuated computer code.  Computer code dictation in particular has been considered in theory.  For example, “Spoken-word direction of computer program synthesis” by Alvin J. Surkan [5] presents ways to setup vocabularies and “agents” that direct the creation of computer code.  Second, “Programming by Voice, VocalProgramming” by Stephen C. Arnold [6], touches on the idea of code insertion but not to the level of development of the system described here.

            The development of Happy Hands was guided by five main goals.  The first and foremost was to reduce repetitive stress injuries caused by the computer keyboard.  Towards this end, a number of related goals have been pursued.  The second goal was to raise the user's level of thought from the detailed syntax of the code to a higher level idea of what he wishes to accomplish.  The third goal was that the system be context sensitive.  We wanted to be able to speak every element of the computer code, without having to edit it afterwards by hand.  This means that symbols such as method names needed to be recognizable from speech and needed to come into the resulting code directly.  The fourth goal was make the editing of existing code as easy as creating new code.  This includes operations such as removing a specified block of code, and doing a copy paste operation via voice, and replacing a sub-expression with new code.  The fifth goal was to allow code to be entered with the keyboard.  A few editing patterns seem to lie more in the domain of the text buffer rather than the logical meaning of the code, therefore the system had resolve both speech input and keyboard input to a common state, smoothly and unobtrusively.  This proved to be one of the most challenging parts of the design.

 

Design

 

            Happy Hands is organized into units that combine to make the large scale data flow.  (Fig 1.)

 

(1) Speech recognizer

            The speech recognizer translates voice input into individual tokens.  Happy Hands makes use of off-the-shelf speech recognition engines.  Both the ViaVoice recognizer from IBM [3] and the Microsoft recognizer version 4 [4] work well. 

            Present day speech recognition engines use two systems for identifying words.  In one system, the recognizer assumes the user is speaking in a conversational style and the recognizer will match segments of audio data against any word in the entire English language, giving a list of words a result.  In the other system the recognizer is configured with a more limited grammar consisting of a tree of rule alternatives, rule sequences, rule counts and tokens.  (This is similar to the widely known regular expressions for pattern identification in a document.)  The rule grammar target is a better choice for speech coding because of the formal nature of computer code, in which only declared symbols, keywords, operators, and punctuation may appear.  Assembling a grammar composed of these symbols, keywords, and operators defines a language enabling one to code by speaking.

 

 (2) Syntax tree

Happy Hands uses a syntax tree as the central data structure that keeps the state of the user’s work.  The two principal advantages of the syntax tree instead over an ordinary text buffer are:

(a) The syntax tree enables the transcriber units to deal with the data at the level of its meaning instead of its representation in human readable form.

(b) Every syntax tree node has a location and size in the document, making it a convenient unit of selection for editing operations such as replacing the selected node with another, inserting a node before or after the selected node, and making changes to attributes associated with the selected node.  Changes are reflected in the node’s text buffer rendition.

 

 

(3) Text buffer

Happy Hands uses a traditional text buffer unit to store the text form of the computer code.  The text buffer is kept in addition to the syntax tree to enable the user to write code using the keyboard.  The keyboard is required for entering new symbol names and string literals, and it is comforting to be able to fall back to the traditional input method any time. 

 

(4) Text generator

The text generator transforms the syntax tree into the computer code text.  Because changes to the syntax tree are made continuously as the user speaks, the text is generated from a small sub-tree rather than the entire syntax tree.  When it is asked to update the text buffer, the text generator finds the common parent of all the changed syntax tree nodes.  Starting at this node it recursively descends into the syntax tree, emitting text into a temporary buffer.  Upon completion, the text generator removes the old text associated with the changed tree node and inserts in its place the newly generated text.

(5) Text parser

The text parser unit is responsible for transforming the computer code text into the syntax tree.  Happy Hands quickly derives the syntax tree from the text by incrementally reparsing small sections of the text document instead of always parsing the entire document, allowing the user to type.  The text parser unit is also invoked when a new file is opened from disk.

As the user types between reparse computations, Happy Hands tracks changes to the syntax tree node sizes so that when it comes time to regenerate the text from the changed syntax tree nodes, the syntax tree nodes have stored their correct sizes so that the correct segment of the text buffer is removed and replaced.   The parser is invoked when the user appears to have finished typing for the moment.

The text parser keeps the sizes and positions of the elements it finds in the text, with a convention matching that of the text generator.  The small syntax tree replaces the old tree enclosing the reparsed text, overlaying the text positions and sizes perfectly.  Optionally, the text generator can be invoked on the small syntax tree to ensure consistent formatting, although this behavior can disturb the user’s typing.

 

(6) Context analyzer

            The context analyzer builds lists of symbol names by analyzing the syntax tree of the source code file with which the user is working.  It also analyzes source code and compiled files to which the edited file refers.  For example, during the transcription the JAVA programming language, the context analyzer keeps lists of field names, variable names, method names, type names, and package names.  It caches its findings, and monitors changes to symbol lists to make intelligent decisions regarding when speech grammars need to be regenerated.

 

(7) Transcribers

The transcriber units set up the speech recognizer engine to listen for and respond to speech recognition events that identify their specialized grammars.  There are also other units similar in design handle tasks not related to emitting computer code: these units handle such tasks as listening to the user’s request to open files and running the compiler.

            The transcriber units are the main user interface to Happy Hands.  There is a different type of transcriber unit for the transcription of every significant type of computer code.  A different transcriber unit is chosen depending on the type of computer code element the user is editing.  The active transcriber is changed in response to the addition of a new code element, or the selection of an existing element.  A few of the many types of transcribers in Happy Hands are: a transcriber for writing class definitions, a transcriber for writing method definitions, a transcriber for writing for loops, and a transcriber for writing expressions.

A transcriber’s first job is to generate grammars for the speech recognizer to match against incoming audio data.  The collective set of these grammars define the spoken language that Happy Hands understands.  The recognizer will respond only to speech that is in the pattern of one of these grammars.  The grammars are designed to resemble English sentences.  Each transcriber sets up the recognizer with grammars specialized for the transcription type of code it handles.  The grammars have essentially a constant overlying structure with one or more variables sections that are populated by symbol names found within the current scope of the computer code.  The context analyzer unit is consulted to return symbol name lists and some common sub-grammars, called grammar rules.

            The speech recognizer sends speech recognition events to the transcribers in the form token sequences.  The transcribers respond by making changes to the syntax tree.  The transcriber analyzes a token sequence to determine what the tokens mean.  It then inserts, removes, or changes elements of the syntax tree according to the pattern matched in the token sequence.  A change the syntax tree triggers the text generator unit to update the text buffer unit.

Upon changes to the document via voice, via keyboard, or a change in insertion point, the grammars that have become out of date are updated to follow the new document.  Grammars are updated asynchronously, meaning grammars that become invalid are not immediately recomputed, but instead are queued and updated after the controlling logic determines that the document context has become stable.  In the case of the JAVA programming language, typical changes requiring grammar updates include addition of declared variables, member fields, member methods, import statements, or a change of scope.

Symbol names found within the computer code are often renamed within a grammar to be spoken more easily.  Conversions include breaking compound words into multiple words, and changing non English words into sequences of independent letters.  Upon recognition, these modified names are remapped back to the original symbol names found within the code.

(8) Text editor

Happy Hands has a graphical text editor to present of the computer code and for editing via the keyboard.  This component represents the syntax tree as well by overlaying rectangles representing the bounds of syntax tree elements on the code.  These rectangles indicate which element of the syntax tree has the focus for speech input and which element has the focus for copy operations. 

The computer’s mouse (or other pointing device) is used to select different elements in the syntax tree.  The selected elements are then highlighted with a surrounding rectangle and become the target of code transcription.  Happy Hands has one special feature in this regard; in many cases is not necessary to press the mouse button to cause a selection change, which is useful because repetitive presses of mouse buttons can aggravate hand problems.  If the syntax tree node under the mouse position is the same type as the syntax tree node selected for transcription, a transcriber change requires painlessly moving the mouse.  Otherwise, the user must press the mouse button to change the active transcriber type.  And, an independently selected element that defines the copy/paste and new element insertion anchor is always changed with a simple mouse move.

 

 

Creation and Editing

 

Happy Hands is designed to facilitate both the creation of new code and the changing of existing code.  One element of the syntax tree is always maintained as the currently selected element.  This element is the target for speech recognition.  Some spoken commands are interpreted to replace the current selection with a replacement element of comparable type.  Other commands insert or append code elements in a position with respect to the current selection.  Four methods are used for coding by speech input:

(a)     The current selection can be replaced with a new section of code. 

(b)    The current selection can be modified.  This method is used when speech input dictates such things as a type modifier or access modifier of a variable declaration, or the addition of an argument within a method arguments list.

(c)     New code can be inserted before, after, around or within the current selection.

(d)    Cut and paste operations.

This approach unifies the addition of new code and the editing of existing code.

            Happy Hands uses a rectangle cursor that has both a position and a character range to indicate the selected element.  It is rendered graphically as a rectangle surrounding the selected element (syntax tree node).  Fig 2 and 3 show the cursor with []’s.   

            Elements that are direct children of a block are inserted into the code with respect to the smallest direct block child that encompasses the currently selected element.  For example, as shown in Fig 3., a while loop can be added by saying “append while loop” or “insert while loop”.  The selection then changes to the “test expression element” of the “while loop element”, enabling the user to specify the expression, as shown in Fig 3.  In general, when adding a statement to a block, the current selection is automatically changed to a sub element, which is then ready to be modified as needed.

            Expressions are assembled by replacing the current selection with the sub expression given by the spoken phrase.  The default expression starts out as a single identifier called “no_value”.  The expression grows from here when a phrase indicating a binary operator is spoken.  For example, saying “addition” will replace “no_value” with

no_value + no_value”.  Speaking a phrase indicating an identifier name will replace one of these “no_value’s” with a valid variable name, field name or method invocation.  The current selection is moved automatically to unspecified expression elements, enabling the user to string expression generation phrases.  The selection cursor dances down the expression as the expression is spoken.  This process is described in Fig 2.

 

 

 

 

Performance Evaluation

 

We have analyzed the speed and reliability of Happy Hands.  The program was tested by the program’s author himself.  The test assumes the speech recognizer has been trained, and that the user is well practiced in the use of Happy Hands.

            To test the reliability, we tracked the recognition accuracy during the dictation of a short pre-written program of 80 lines.  A data point was taken after every spoken phrase.  Recognition accuracy results are divided into four categories:

1) the recognizer understood the spoken phrase perfectly - the best case

2) the recognizer didn't understand with sufficient confidence to respond, so that the user needs repeat the statement

3) the recognizer misunderstood the spoken phrase but responded, resulting in unintended code

4) the recognizer partially misunderstood the spoken phrase, which usually results in the matching of some imagined tokens at the end, leading to partially unintended code

            In the first test, 125 spoken phrases were made during the course of the dictation.  Ninety-nine were recognized perfectly, 24 were ignored, and 2 were misrecognized.  We think these results are more than good enough to establish user confidence in the system.  These results show both the quality of the speech recognition engine and the design of the spoken language.  Our spoken language is designed to force the user to say rather verbose phases.  This increases the recognition rate by giving the recognizer a longer, more featured audio profile to analyze.  Another aspect of the design that affects recognition accuracy is the number of grammars that are available at any one time and the size of these grammars.  Happy Hands keeps several grammars available for use at anytime, whereas others are turned on and off according to the current insert point in the code.  This system strikes a good balance between recognition accuracy and convenience for the user.

 

 

 

Perfect match

No result

All wrong

Trailing

New code, ViaVoice

99 events

24 events

2 events

0 events

Table 1.  Recognition reliability

 

 

            To compare the relative speeds of coding by speech recognition versus coding by typing, we measured the overall time required to copy a JAVA source file.  The file may be downloaded from [7] for verification purposes.  Coding by speech recognition is about twice as slow coding by typing.  (25 minutes vs. 12 minutes).  However, these results do not tell the whole story.   A programmer rarely copies files during coding.  Instead, much of a programmer’s time is spent planning the code, with short intermittent periods of imputing the text.   Our sample code actually took an hour to design and write the first time.  Realistically, the total time of coding via speech recognition vs. coding by typing would be 70 minutes vs. 60 minutes.  And in actual practice, using the system is a fairly relaxing and satisfying task of speaking a few lines of code, testing it, and speaking a few more lines and editing some existing lines.  Coding via speech in fact doesn't seem to slow down a programmer at all.  At the end of the day, his code is still written, and beautifully formatted, and tested, and his hands aren’t any worse for the wear. 

 

 

 

 

Time with speech

Time by typing

Copying file, 80 lines

25 minutes

12 minutes

Table 2: Coding speed

 

 

References

 

[1] The Happy Hands Java Speech Editor, http://www.h-dm.com/HappyHands

[2] IBM Speak Pad, IBM Corporation

[3] Via Voice, IBM Corporation

[4] Microsoft Speech Recognizer Engine, Microsoft Corporation

[5] Alvin J. Surkan, Spoken-word direction of computer program synthesis.  Proceedings of the APL00 Conference

[6] Programming by Voice, VocalProgramming by Stephen C. Arnold.  Proceedings of the fourth international ACM conference on Assistive technologies.

[7] Speed test code, http://www.h-dm.com/HappyHands/testing/acm_speed_test.java

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

buffer changes

 

Fig 1: Large scale data flow

Speech recognition input and keyboard input are resolved to a single data structure – a syntax tree - that holds the state of the user’s file.  Speech input results in a modification of the syntax tree, and then a modification of the text buffer.  Keyboard input results in a modification of the text buffer, and then a modification of the syntax tree.  Speech grammars are developed by analyzing the syntax tree and given to the recognizer.

 

 

 

 

 

Fig 2: Expression synthesis

Expressions are built with Happy Hands by speaking symbol names and operator names in any order.  Variable and field names are prefixed with “variable” and “field” to make recognition more reliable.  This sequence assumes the variables apple, orange, avocado and grape have already been declared, as Happy Hands understands only formally declared symbol names.  The expression begins as [no_value].  “Variable apple” causes the expression transcriber to replace [no_value] with [apple], then “assignment” causes [apple] to be replaced with “apple = [no_value]”.  Generally, the selected element is replaced with the code given by the spoken phrase.  The entire phrase is processed and rendered to the text buffer at once.

 

 

 

 

Fig 3. Block element synthesis

The while loop is created with “append while loop”, or “insert while loop”, or “enclose with while loop”.  The selection is changed automatically to the test expression.  The expression is spoken in the manner described in Fig 2.