Notes
Slide Show
Outline
1
Multimodal Framework Design,
Children’s Multimodal Usage Patterns and the
Efficacy of Mutual Disambiguation for Children
  • NYU Computer Science
  • Ph.D. Thesis Proposal
  • by
  • Christopher A. Robbins
2
Thesis Proposal Presentation: Part I
  • Multimodal Interface Fundamentals
  • Human Factors in Speech Based Interfaces
  • Speech and Gesture Based Multimodal Interface Design History
  • Related Speech and Gesture Based Multimodal Interface Studies



3
Thesis Proposal Presentation: Part II
  • Proposed Thesis Overview and Goals
  • Current Work in Multimodal Frameworks
  • Proposed Framework Implementation
  • Proposed Children’s Multimodal Usage Study
  • Schedule and Deliverables
4
Part I: Multimodal Interfaces
5
What are Multimodal Interfaces?
  • Multimodal Interfaces process two or more combined user input modes in a coordinated manner with multimedia system output.
  • Past multimodal interfaces have combined such input modes as speech, pen, touch, manual gesture, gaze, and head and body movements.
6
Advantages of Multimodal Interfaces
  • Choice of modality for user to
    • convey different types of input,
    • use combined modes, or
    • alternate between modes
  • Potential to accommodate a broader range of users
  • Prevent overuse and physical damage through one modality
  • Ability to accommodate changing conditions
  • Efficiency gains
  • Improved error handling
7
Differences Between Conventional GUIs & Multimodal Interfaces
  • Conventional GUIs
  • Assume there is a single event stream that controls event loop with processing being sequential
  • Assume interface actions are atomic and unambiguous
  • Do not require temporal constraints. Architecture is not time sensitive.
  • Multimodal Interfaces
  • Typically process continuous and simultaneous input from parallel input streams
  • Process input modes using recognition based technology
  • Require time stamping of input and development of temporal constraints on mode fusion operations.
8
Multimodal Interfaces Terminology
  • Multimodal Interfaces - interfaces that process two or more input modes in a coordinate manner with multimedia output.
  • Active Input Modes – modes that are intentionally deployed by the as an explicit command.
  • Passive Input Modes – modes that consist of naturally occurring user behaviors or actions that provide input to an application and do not require an explicit command.
  • Blended Multimodal Interfaces – multimodal interfaces that incorporate at least one passive and one active input mode
  • Temporally-cascaded Multimodal Interfaces – multimodal interfaces whose modalities tend to be sequenced in a particular temporal order such that partial information from one modality is available to constrain interpretation of a later mode.
9
Multimodal Interface Terminology
  • Mutual Disambiguation – disambiguation of the input content of a possible error prone modality from partial information supplied by another.
  • Deictic Gestures – a gesture that contributes to the identification of an object (or group of objects) merely by indicating their location.
  • Referent Object(s) – the object (or group of objects) to which a Diectic Gesture refers.
  • Feature Level Fusion – method for fusing low-level information from parallel input streams within a multimodal architecture.
10
Multimodal Interface Terminology
  • Semantic-level Fusion – method of integrating semantic information derived from parallel input modes in a multimodal architecture.
  • Frame-based Integration – pattern matching technique used for merging attribute value data structures in order to fuse semantic information from two input modes into a common meaning.
  • Unification-based integration – a logic based method used for integrating partial meaning fragments from two input modes into a common meaning.
  • Windows-Icon-Mouse-Pointer (WIMP) Interfaces – refers to conventional OS and application interfaces based upon the  window and icon metaphor GUI using the mouse as a pointing device.
11
Speech and Gesture Based Multimodal Interfaces
  • …are multimodal interfaces combining speech recognition and gesture recognition.
  • ...are the most pervasive and thoroughly studied multimodal input combination to date.
  • …represent the most accessible multimodal combination as microphones and touchpad or pen input devices are widely available and often packaged with current personal computers and laptops. Also, gesture based input can ;;;;;;;also be provided via conventional mouse input devices.
  • ...design and implementation guidelines are far from full maturity leaving much opportunity for research possibilities.


12
Human Factors Considerations in Speech Based Interfaces
  • Important to differentiate between command based and natural language speech interfaces:
    • Command speech involves the recognition of token words or phrases, whereas
    • Natural language involves the recognition of the full sentence structures and richness of a natural spoken language such as English.
  • Voice quality concerns in speech recognition
    • Interpersonal variations in voice caused by physical features, cultural speech patterns, gender, age, native language, and accents.
    • Intrapersonal variability with time of day, health, background noise


13
Short Term Memory for Speech Based Input
  • WIMP based GUIs avoid short term memory requirements by continuously presenting a discrete set of allowed input choices.
  • However for speech based interfaces to benefit from their inherent expressiveness, memory is relied upon to recall the set of allowed input commands.
  • Thus speech based interfaces often need to provide a guidance system similar to those presented by GUIs (i.e. “What can I say?”), as an initial training mode.  The guidance system also help to establish the actual vocabulary available to the user.
14
Conversational Technique
  • Speech based interfaces also need to address problems arising from clarification, back-channel utterances, dialog repair, turn taking, and topic introduction.
    • Clarification – repetition of the same command request using new syntax
    • Back-channel utterances – “uh huh”, “um”, and “ah”
    • Dialog repair – pause in speech to backtrack and rephrase input
    • Turn taking – human-computer speech interfaces require complete input commands versus interactive commands common in human-human interaction
    • Topic introduction – can change the semantic meaning of syntactically similar commands
  • The conclusion of an input command utterance also more ambiguous in speech based input
15
Speech and Gesture Based Multimodal Interface History
  • “Put-That-There” – Bolt 1980
  • CUBRICON – Neal 1990
  • XTRA – Wahlster 1991


16
“Put-That-There” Introduction
  • Developed in 1980 by the Architecture Machine Group at MIT
  • Led by Richard Bolt
  • Most highly referenced multimodal system.
  • One of the earliest multimodal interfaces
  • Combined voice and gesture based input.
17
“Put-That-There” Overview
  • Interaction takes place in a  media room about the size of a personal office.
  • Gesture based input is primarily deictic arm movements use to point and drag objects
  • Speech input allows for simple English sentence structures using a limited vocabulary.


18
“Put-That-There” Example Scenario
  • Example scenario consists of input requests for performing the commands on basic geometric objects as follows:
    • creating
    • customizing
    • moving
    • copying
19
“Put-That-There” Example Scenario
  • Example speech input commands include:
    • “Create a blue square there”
    • “Make that smaller”
    • “Put that there”
  • Diectic gestures used to resolve pronoun ambiguity.
  • Uses speech recognition vocabulary to limit input commands thus improving efficiency.
20
CUBRICON Introduction
  • Developed in 1990 as a military situation assessment tool.
  • Led by J. Neal
  • Combined spoken and typed natural language with deictic gestures for both input and output


21
CUBRICON System Overview
  • Similar to “Put-That-There” in that it utilizes pointing gestures to clarify references to entities based on simultaneous NL input
  • Introduced the concept of generating a multimodal language based upon a dynamic knowledge base, user model, and discourse model
22
CUBRICON Architecture
23
CUBRICON Knowledge Sources
  • The following knowledge sources are used by the parser to interpret the compound input stream:
    • Lexicon – dictionary of words/commands
    • Grammar – defines the multimodal language
    • Discourse Model – dynamically maintains knowledge pertinent to the current dialog
    • User Model – aid in interpretation based on user profile, goals, and plans
    • Knowledge Base – contains information related to the task space

24
CUBRICON Interface
  • Mouse input used to select on-screen content such as windows, tables, icons, and points.
  • Spoken and written input to specify action(s) that refer back to selected objects:
  • Multiple object selections allow per spoken input, as in “What are these items?” while sequentially pointing to several objects.
  • Provided initial insight into the use of multiple input modes working cooperatively to achieve greater accuracy.
25
CUBRICON Multimodal Output
  • Introduced design guidelines for multimodal  gesture and speech output
    • System visually indicates referent objects in parallel with output speech
    • Indicates containing windows of referent objects
    • Weakly highlights multiple windows and icons if more than one instance of a referent object exists


26
XTRA Introduction
  • XTRA (eXpert TRAnslator)
  • Introduced in 1991 as an intelligent multimodal interface to expert systems
  • Researched and developed by Wolfgang Wahlster at the German Research Center for Artificial Intelligence
27
XTRA Overview
  • Builds upon User and Discourse Model Research
  • Combines natural language, graphics, and pointing for both input and output.
  • Uses focus granularity based gesture analysis methodology that…
    • constrains referents in speech to a gesture based region
    • aids the system in interpretations of subsequent definite noun phrases which refer to objects in the focused area
28
XTRA Illustrative Tax Form Application
  • Use of XTRA to facilitate filling in a tax form.
  • Gesture based input and output both occur in the tax form display section to the left.
  • Natural language input and response text are displayed in the panel to the right.


29
XTRA Input
  • Four levels of granularity are provided for indicating the selected area of dialog focus:
    • Exact pointing with a pencil
    • Standard pointing with the index finger
    • Vague pointing with the entire hand
    • Encircling a region with an ‘@’ sign
  • Also allows three types of gestures: point, underline and encircle.
30
XTRA Dialog Focus Support
  • Dialog focus used to indicate areas of attention.
  • Thus focusing gestures modify the discourse model.
  • Removes ambiguity of language based references to multiply occurring fields or objects
31
XTRA Intrinsic and Extrinsic Relations
  • Discourse model also modified based upon intrinsic and extrinsic relations.
  • For example the phrase, “the right circle”:
    • Intrinsically implies circle 5 if no pointing gesture is present
    • Extrinsically implies circle 3 if the pointing gesture is considered
32
XTRA Output
  • XTRA’s output combines language and gesture based upon the discourse model.
  • Gesture output capability includes the same granular area indicators, attention specifiers, and intrinsic versus extrinsic assumptions used for input.
33
Illustrative Speech and Gesture Based Multimodal Interface Systems
  • QuickSet – Cohen 1997
  • Human-Centric Word Processor – Vergo 1998
  • Portable Voice Assistant – Bers 1998
  • Field Medic Information System – Holzman 1999


34
QuickSet System
  • Java-based architecture for providing speech and pen multimodal interfaces for map-based applications
  • Collaborative multimodal system designed to run on multiple platforms and work with a collection of distributed applications
35
QuickSet Interface
  • 2D map oriented design takes advantage of research indicating that multimodal gesture and speech interfaces are best suited for spatial interaction.
  • Supports gesture symbols for creating and placing entities on the map
  • Allows map annotation using points, lines, and areas


36
LeatherNet Implementation Using Quickset
  • LeatherNet is a military set-up and control simulation using the QuickSet Multimodal Interface architecture
  • Symbols used by LeatherNet appear to the left



37
QuickSet Multimodal Contributions
  • Provides a general framework to facilitate the inclusion of multimodal interfaces with 2D map based applications
  • Introduces a framework supporting gesture input symbols



38
Human-Centric Word Processor (HCWP)
  • Word processing system that allows content to be input via continuous real-time dictation
  • Content can then be edited using a speech and pen based interface.
  • Addresses desire to correct and edit text obtained via speech dictation
  • Designed by IBM research
39
HCWP Multimodal Text Editing
  • Provides a conventional text editing interfaces using the keyboard
  • Spatial text editing allows for multimodal pen input to format, correct, and manipulate dictated text
    • Example input command include: “Delete this word”, or “Underline from here to here”.
    • Usability tests showed multimodal text editing capability reduced task completion relative to unimodal speech editing
40
HCWP Speech Input Consideration
  • Addressed human speech factor of uncertain command conclusion using a microphone with toggle switch.
  • Avoids problems arising from a modeless speech input interface.
  • Results showed users quickly adopted modal speech input.
41
Portable Voice Assistant
  • Pen and speech recognition architecture for developing multimodal interfaces for online web content.
  • Designed to run on mobile pen based computers with microphones and a wireless connection to the Internet (i.e. Tablet PCs).


42
VoiceLog Prototype Using PVA
  • VoiceLog is a Web applet using PVA
  • Allows users to browse an online catalog and order parts.
  • Also allows users to inspect technical diagrams and schematics of machines and parts to determine correct parts
43
VoiceLog Usage Scenario
  • Scenario (U: user, VL: VoiceLog)
    • U: “Show me the humvee.”
    • VL: displays image of HMMWV
    • U: “Expand engine.”
    • VL: flashes area of the engine and replaces the image with a part diagram for the engine
    • U: (while pointing at a region with the pen) “Expand this area.”
    • VL: Highlights area, the fuel pump, and displays expanded view.
    • U: (points to a specific screw in diagram) “I need 10.”
    • VL: brings up order form filled out with the part name, part number, and quantity fields for the item.
44
Portable Voice Assistant Contributions
  • Modular multimodal architecture design
  • Centralized speech recognition server
  • Simple single windowed interface
  • Web based architecture
45
Field Medic Information System
  • Portable system providing a speech and pen based multimodal interface allowing medical personnel to remotely document patient care and status in the field
  • Consists of two hardware devices:
    • Wearable computer with headset
    • Handheld tablet PC
46
Field Medic Contributions
  • Illustrates the benefits of multimodal interface adaptability by providing a hands-free speech input device with the ability to switch to a pen based input device for complex interactions.
  • Note that the Field Medic Information system is multimodal, because it allows the user to choice from multiple inputs modes. However it was not designed to process these input modes simultaneously.
47
Multimodal Interface Use Studies
  • Human Factor in Integration and Synchronization of Input Modes
  • Mutual Disambiguation of Recognition Errors
  • Multimodal Interfaces for Children
48
Human Factors in Integration and Synchronization of Input Modes
  • Study conducted by Oviatt to explore multimodal integration and synchronization patterns that occur during pen and speech based human-computer interaction.
  • Evaluate the linguistic features of spoken multimodal interfaces and how the differ from unimodal speech recognition interfaces.
  • Determine how spoken and written modes are naturally integrated and synchronized during multimodal input construction
49
Human Factors Study Interface
  • Users asked perform the real estate agent task of selecting suitable homes based on a given client profile
  • Homes were displayed on a map, and clients specified environment features they preferred
  • Users could interact with interface using speech, gesture, or both.
50
Human Factors Study Results: Multimodal Usage Preferences
  • 100% of users used both spoken and written commands
  • Interviews revealed that 95% of users preferred to interact multimodally


51
Human Factors Study Results:  Multimodal Input Usage by Command Type
52
Human Factors Study Results: Multimodal Input Command Frequencies
  • In 100% of multimodal inputs, pen was used to indicate location and spatial information



53
Human Factors Study:
Multimodal Usage Analysis Highlights
  • 82% of multimodal input constructions involved draw and speak, while point and speech represented only 14%.
  • Of draw and speak constructions, 42% were simultaneous, 32% sequential, and 12% were compound (i.e. draw graphic, then point to it while speaking).


54
Mutual Disambiguation of Recognition:
Overview
  • Study conducted by Oviatt to research the benefits of mutual disambiguation for avoiding and recovering from speech recognition errors
  • Participants were asked to interact with a map-based multimodal system in which the user sets up simulations for community and fire control scenarios
  • During the course of performing requested simulation set up tasks each subject entered about 100 multimodal commands
55
Mutual Disambiguation Study: Goals
  • Research the benefits of using mutual disambiguation to reduce multimodal input recognition error rates.
  • Explore differences in mutual disambiguation usage frequency based on gender and on native versus non-native speakers
56
Mutual Disambiguation Measure
  • Mutual Disambiguation dependent measure was developed
  • Rate of mutual disambiguation per subject (MDj)
  • Percentage of all scorable integrated commands (Nj) in which the rank of the correct lexical choice on the multimodal n-best lists (RiMM) was lower than the average rank of the correct lexical choice on the speech and gesture n-best lists (Ris and Rig), minus the number of commands in which the rank of the correct choice was higher on the multimodal n-best list than its average rand on the speech and gesture n-best lists:
57
Other Mutual Disambiguation Measures
  • Measure of Multimodal Pull-Ups –
    • Multimodal pull-ups occur during MD when either the correct lexical choice for speech, for gesture, or for both were retrieved from a worse ranked position than first choice on their respective n-best lists
  • Percentage of correct…
    • Unimodal Speech Recognitions
    • Unimodal Gesture Recognitions
    • Multimodal Speech and Gesture Recognitions


58
Mutual Disambiguation Study Results
  • Of correctly recognized multimodal inputs, Mutual Disambiguation was used to produce 1 in 8.
  • MD used to more often to correct non-native speaker inputs
  • MD usage differences based on gender were insignificant
  • Results summarize in table below:


59
Mutual Disambiguation Study Results (cont…)
  • Multimodal Pull-ups for speech was higher for non-native speakers, 65%, than for native speakers, 35%.
  • Successful speech recognition alone was 72.6% for native speakers and 63.1% for non-native speakers.
  • Overall recognition rate remained lower for non-native speakers, 71.7% versus native speakers, 77%, however this difference is an improvement over speech recognition alone.
60
Multimodal Interfaces for Children
  • Study by Xiao into the multimodal interface interactions of children
  • Performed a comprehensive analysis of children’s multimodal integrations patterns in a speech and pen based interfaces to compare with the integration and synchronization patterns of adults.
61
Multimodal Interfaces for Children Study Overview
  • Participants aged 7 to 10 asked to interact with a science education application called Immersive Science Education for Elementary Kids (I SEE!).
  • I SEE! teaches about marine biology through animated marine animals
  • Input involved asking marine animals questions formulated using speech, pen, or both.
  • Speech input recognition was initiated by a pen tap or writing.
62
Multimodal Interfaces for Children Study Results
  • 6500 usable child utterances were obtained
  • Unimodal speech usage dominated child utterances:
    • 10.1% multimodal
    • 80.4% unimodal speech
    • 9.6% pen input alone
  • 93.4% of multimodal pen input involved abstract scribbling
  • Of interpretable pen input, 93.2% complemented correlating speech input.
63
Multimodal Interfaces for Children Study
Comparison Between Children and Adults
  • Similar to multimodal interaction patterns of adults:
    • Dominant multimodal integration type was often adopted early in the interaction session.
    • Pen preceded speech in most multimodal utterances.
    • Multimodal input was used to convey complementary semantic information.
  • Sequential multimodal lag averaged 1.1 sec and ranged from 0 - 2.1 seconds which is consistently less than adults.
  • Children showed a higher tendency to engage in manual pen activity that adults.
64
Part II: Proposed Thesis
65
Thesis Motivation
  • Research challenges involved in advancing the field of multimodal interfaces include:
    • Creation of tools which facilitate the development of multimodal software
    • Development of metrics and techniques for evaluating alternative multimodal systems
    • Study of how people integrate and synchronize multimodal input during human-computer interaction.
66
Thesis Goals
  • Design of an Application Programming Interface (API) Framework for developing speech and gesture based multimodal interfaces with a focus on interfaces involving interaction with Web accessible 3D environments.
  • Implement an illustrative application using the proposed framework to study multimodal integration patterns in children and the efficacy of mutual disambiguation usage with children.
67
Speech and Gesture Framework API: Goals
  • Provide unimodal gesture and speech recognition as well as multimodal combined speech and gesture recognition support for:
    • Navigating, creating, and populating 3D environments
    • Controlling and directing inhabitants or avatars in such environments
  • Web based support for applications using unimodal and multimodal speech and gesture recognition
  • Automated statistics gathering tools for both unimodal and multimodal speech and gesture usage as well as multimodal disambiguation statistics
68
Secondary Framework API Motivation and Goal
  • Laptops and tablet PCs are well suited for multimodal speech and gesture based interfaces as they include built-in:
    • Microphones
    • Gesture supporting input devices such touch pads, TrackPoints, pens, and styli.
  • A framework goal is to ensure support of laptops and tablet PCs.
69
Parallel Multimodal Framework Development
  • Rutgers University Center for Advanced Information Processing (CAIP): Framework for Rapid Development of Multimodal Interfaces.
  • AMBIENTE – Workspaces for the Future: Multimodal Interaction Framework for Pervasive Game Applications called STARS
  • Pennsylvania State University and Advanced Interface Technologies, Inc.: A real-time Framework for Natural Multimodal Interaction with Large Screen Displays.
  • W3C - Multimodal Framework Specification for the Web (unimplemented).


70
Rutgers CIAP Framework
  • Java based API for multimodal interface development.
  • Highly generic late fusion multimodal integration agent for managing diverse input modes
  • Differences from proposed framework:
    • Web accessibility not considered
    • No statistics collection
    • No focus on 3D interaction

71
AMBIENTE STARS Framework
  • .NET based framework supporting multiple input modes for interactivity with board games and tabletop strategy or role playing games.
  • Focus on support for social activity inherent in such games via networking.
  • Differences from proposed framework:
    • Highly generic mode support
    • No statistics collection
    • Focus on 2D discrete gaming environments


72
Penn State Framework
  • Java based framework to facilitate speech and gesture based multimodal interfaces involving a large screen display
  • Design to facilitate multimodal interface design research for large screen displays
  • Differences from proposed framework:
    • Focuses on multimodal interface output
    • No statistics collection
    • Not focused on 3D environment interaction
73
Parallel Framework Development Summary
  • Validate research into multimodal interfaces framework design and the need for such interfaces
  • Provide methodologies and design approaches for developing multimodal frameworks.
  • Reveal unexplored aspects of  multimodal framework design.
74
Proposed Framework API Implementation: Overview
  • Java based API
  • Speech Recognition (JSAPI)
  • Gesture Recognition (Customized HHReco)
  • Multimodal Integration
  • 3D Rendering Environment (Perlin Java 3D Renderer)
75
Java Speech API (JSAPI)
  • Supports command and control recognizers, dictation systems, and speech synthesizers
  • Speech recognition for web accessible applets and applications
  • Requires a JSAPI compatible implementation and native speech recognition engine
  • Native speech recognition engine can be packaged with applet or application and automatically download via Java WebStart for convenient web accessible applets and applications
76
Java Speech Grammar Format (JSGF)
  • Same format to be made available to users of the framework for specifying valid speech and multimodal input commands.
  • JSGF “Put-That-There” Example:

    #JSGF v1.0

    grammar PutThatThere;

    public <Command> = <Action> <Object> [<Where>];

    <Action>  = put {PUT_ACTION} | delete {DEL_ACTION};
    <Object>  = <pronoun> | ball {BALL_OBJECT} |
                square {SQUARE_OBJECT};
    <Where>   = (above | below | to the left of | to the right of)
                <Object>;
    <Pronoun> = (that | this | it) {OBJECT_REFERENT} |
                there {LOCATION_REFERENT}


77
JSGF (cont…)
  • A JSGF file can be loaded as the foundation for a recognized command grammar
  • However this grammar can be dynamically modified by the application for updating the command grammar based on changes to the user model, discourse model, or knowledge database.



78
TalkingJava SDK and Microsoft Speech Application SDK
  • CloudGarden’s TalkingJava SDK
    • Full JSAPI specification support
    • Integration capability with Microsoft Speech Application SDK
  • Microsoft Speech Application SDK
    • Text-to-speech engine supporting multiple voices
    • User trainable speech recognition engine
  • Supported speech input devices will include built-in laptop or tablet PC microphones, desktop microphones, and headset microphones.



79
Gesture and Symbol Recognition
  • Input handler to be implemented for non-symbol gestures including:
    • Deictic gestures
    • Spatial gestures
    • Conventional mouse equivalent gestures
  • Symbolic gestures handler through modified HHreco symbol recognition toolkit
    • Java API
    • Adaptive multi-stroke symbol recognition system
    • Designed for use off-the-shelf or customized
  • Supported gesture input devices will include conventional mouse, touchpad, TrackPoint, and stylus.
80
Unification-based Multimodal Integration Agent
  • Unification is an operation that takes multiple inputs and combines them into a single interpretation
  • Framework will utilize semantic unification to produce an N-best list consisting of plausible unimodal and multimodal interpretations
  • To be implemented partially based on QuickSet multimodal unification methodology:
    • Described by Cohen and Johnston
    • Based on a typed feature structure (FS) consistently designed across unimodal and multimodal input modes.

81
Multimodal Unification Example




82
Gesture Input Feature Structure
  • Feature structure resulting from user drawing cube symbol gesture at (x, y z) world coordinates (100.0, 0.0, 50.0) of size (2.0, 2.0, 2.0) at time 10min, 20s, 200ms.
  • Gesture_FS {
      Classification: OBJECT
      Type:           CUBE
      Command:        INSTANTIATE
      Location:       (100.0, 0.0, 50.0)
      Size:           (2.0, 2.0, 2.0)
      Time:           10:20:200
      . . .
    }
83
Speech Input Feature Structure
  • Feature structure resulting from user saying, “Create green cube”
  • Speech_FS {
      Classification: OBJECT
      Type:           CUBE
      Command:        INSTANTIATE
      Color:          green
      Time:           10:14:200
      . . .
    }
84
Unified Multimodal Input Feature Structure
  • Multimodal feature structure resulting from integrating prior speech and gesture feature structures:
  • Multimodal_FS {
      Classification: OBJECT
      Type:           CUBE
      Command:        INSTANTIATE
      Color:          green
      Location:       (100.0, 0.0, 50.0)
      Size:           (  2.0, 2.0,  2.0)
      Time:           10:14:200
      . . .
    }
85
Multimodal Integration Agent Summary
  • Consistent semantic level feature structures passed to multimodal integration agent from speech and gesture recognition components.
  • If multiple input interpretations are plausible an N-best list is produced.
  • Selection of final interpretation from N-best list is based on default priorities or customized priorities specified by the framework API user.
  • This customization ability allows for testing various prioritization settings.
86
Perlin Java 3D Renderer
  • Java compatible API for software 3D rendering.
  • Ability to create 3D content using primitives, such as cubes, cones, cylinders, and spheres, as well as more complex object defined on a per vertex and normal basis.
  • Supports multiple directional lights and full camera location, orientation, and aim fucntionality.
  • Add-on utilities support importing VRML2.0 and Half-Life models.
  • Support for native accelerated graphics cards through Java OpenGL (JOGL) library.


87
Statistics Collection Engine
  • Communicates with multimodal integration agent to collect user input statistics.
  • Statistics based on feature structures generated by speech, gesture, and multimodal integration components.
  • Statistics collected include:
    • Percentage of each input type usage
    • Elapsed time between modalities for multimodal input
    • Percentage of correct results in interpreting each input
    • Mutual disambiguation usage statistics
  • Gathered statistics logged in a server database, file, or passed to user via form mailer.




88
Illustrative Framework Usage
  • Implement multimodal interface for RAPUNSEL project using framework API
  • Analyze both unimodal and multimodal gesture and speech usage patterns in middle school aged children
  • Explore benefits of mutual disambiguation use with children.
89
Illustrative Framework: Goals
  • Provide an representative application built upon the proposed framework API
  • Illustrate the use of the framework’s statistics gathering capabilities to acquire data to:
    • perform an analysis of children’s interactions with multimodal interfaces, and
    • investigate the benefits mutual disambiguation use with children
  • Publish results of this analysis such that they contribute to the design of multimodal interfaces for children involving speech and gesture recognition
90
RAPUNSEL Project Overview
  • Real-time Applied Programming for Underrepresented Children’s Early Literacy.
  • Ongoing three-year NSF funded research project.
  • Software environment to introduce programming to middle school aged children through a socially engaging networked game.
  • Children adopt interactive animated characters that they can engage in adventures within a 3D world.
91
RAPUNSEL Multimodal Interface Goals
  • Allow child users to interact multimodally with the 3D RAPUNSEL world to
    • navigate his or her character,
    • create and interact with other characters and objects, and
    • ask questions about objects and the relationship between objects.
92
RAPUNSEL Interface Use Case Scenario
  • C: “Make green ball” Makes circular symbol gesture at location somewhere in 3D world scene.
    (new ball appears at the gestured to location)
  • C: “No! Make it red.”
  • (same ball turns red)
  • C: “Make a river like this.” Draws rivers path along the ground.
  • (a river following that path appears between child’s character and the ball)
  • C: “Walk to it.” No gesture.
    ‘it’ is assumed to be the river based on discourse model.
    (character walks to river)
  • C: “No! Walk to the [unintelligible word].” Points at the ball. Mutual disambiguation used to establish unintelligible word was ‘ball’
    (Character shakes head as it cannot cross the river)
  • C: “Can you cross this?” Points to character and then gestures along river’s path.
  •    (Character speech balloon appears stating a bridge is required to cross the river)
  • C: “Create a bridge in front of Wobble.” No Gesture. Wobble is the character’s name.
    (A bridge appears in front of Wobble)
93
RAPUNSEL Unimodal Speech Support
  • navigating and directing child’s character, e.g., “Walk to the ball.” or “Pick up the ball.”;
  • creating objects, e.g., “Create a ball in front of the tree.”;
  •  changing objects, “Make it green.”; and
  • asking questions, “Can my character cross the river?”
94
RAPUNSEL Unimodal Gesture Support
  • navigating and directing child’s character, e.g. dragging gesture from character’s location to another and pointing to an object to be picked up;
  • creating objects, e.g. circular symbol gesture to create a ball at the location of the gesture, or river symbol gesture draw along a path to create river; and
  • moving objects, e.g., pointing to and dragging an object from one location to another.
95
RAPUNSEL Multimodal Support
  • navigating and directing child’s character, e.g. “Walk to this.” while pointing to an object to be walked to, or “Pick the ball up.” while pointing to a particular ball;
  • creating objects, e.g. “Create a green ball here.” while pointing to a location;
  • changing objects, e.g. “Make this bridge wider.”, while dragging along the width of the bridge;
  • moving objects, e.g. “Move this from here to here” as child first points to object then points to the new desired location; and
  • asking questions, e.g. “What is this?” said whilst pointing to an unknown object.
96
Mutual Disambiguation with Children
  • Objectives
    • Collect and analyze mutual disambiguation statistics obtained from RAPUNSEL interaction session.
    • Study differences in the efficiency of mutual disambiguation for children versus adults.
  • Motivations
    • Children tend to gesture more often than adults in multimodal interfaces.
    • Unimodal speech recognition is more difficult with children than adults
    • Hypothesis is that opportunities to take advantage of mutual disambiguation will occur more often as more speech recognition error occur and more gesture input is available



97
RAPUNSEL Use Study Procedure
  • Children will participate in sessions where they be asked to interact with their character in the RAPUNSEL 3D world to accomplish a given set of tasks including:
    • various character navigation goals
    • creation and modification of objects
    • acquisition of answers to questions about objects in the world
98
RAPUNSEL Use Study Subjects
  • Each interaction session will require ~30 middle school aged volunteer.
  • Multiple session will be conducted simultaneously in computer classes during regular school hours.
  • The set of students working on a particular session will be randomly assigned.
  • RAPUNSEL project subject pool is already much larger.
99
RAPUNSEL Use Study Variables
  • Independent Variables
    • Tasks to be accomplished during each interaction session.
    • Available time for each interaction session
  • Dependent Variables
    • Unimodal gesture input usage percentage
    • Unimodal speech input usage percentage
    • Multimodal input usage percentage
    • Elapsed time between modalities in multimodal inputs.
    • Mutual disambiguation
      • Usage percentage
      • Pull-ups
100
RAPUNSEL Use Study Materials
  • Desktop, Laptop, or Table PC with
    • Multimodal RAPUNSEL interface software
    • Supporting libraries
    • Built-in or external microphone
    • Mouse, stylus, touchpad, or TrackPoint


101
Thesis Study Schedule: Phase I
102
Thesis Study Schedule: Phase II
103
Thesis Study Schedule: Phase III
104
Thesis Study Deliverables
  • Multimodal framework API with documentation.
  • NSF grant proposal.
  • Multimodal framework API report.
  • RAPUNSEL application version with multimodal interface.
  • Statistics gathered from RAPUNSEL multimodal interface usability test.
  • Report on multimodal usage patterns and mutual disambiguation for children.
  • Ph.D. Doctoral Dissertation covering this effort in detail.