|
1
|
- NYU Computer Science
- Ph.D. Thesis Proposal
- by
- Christopher A. Robbins
|
|
2
|
- Multimodal Interface Fundamentals
- Human Factors in Speech Based Interfaces
- Speech and Gesture Based Multimodal Interface Design History
- Related Speech and Gesture Based Multimodal Interface Studies
|
|
3
|
- Proposed Thesis Overview and Goals
- Current Work in Multimodal Frameworks
- Proposed Framework Implementation
- Proposed Children’s Multimodal Usage Study
- Schedule and Deliverables
|
|
4
|
|
|
5
|
- Multimodal Interfaces process two or more combined user input modes in a
coordinated manner with multimedia system output.
- Past multimodal interfaces have combined such input modes as speech,
pen, touch, manual gesture, gaze, and head and body movements.
|
|
6
|
- Choice of modality for user to
- convey different types of input,
- use combined modes, or
- alternate between modes
- Potential to accommodate a broader range of users
- Prevent overuse and physical damage through one modality
- Ability to accommodate changing conditions
- Efficiency gains
- Improved error handling
|
|
7
|
- Conventional GUIs
- Assume there is a single event stream that controls event loop with
processing being sequential
- Assume interface actions are atomic and unambiguous
- Do not require temporal constraints. Architecture is not time sensitive.
- Multimodal Interfaces
- Typically process continuous and simultaneous input from parallel input
streams
- Process input modes using recognition based technology
- Require time stamping of input and development of temporal constraints
on mode fusion operations.
|
|
8
|
- Multimodal Interfaces - interfaces that process two or more input modes
in a coordinate manner with multimedia output.
- Active Input Modes – modes that are intentionally deployed by the as an
explicit command.
- Passive Input Modes – modes that consist of naturally occurring user
behaviors or actions that provide input to an application and do not
require an explicit command.
- Blended Multimodal Interfaces – multimodal interfaces that incorporate
at least one passive and one active input mode
- Temporally-cascaded Multimodal Interfaces – multimodal interfaces whose
modalities tend to be sequenced in a particular temporal order such that
partial information from one modality is available to constrain
interpretation of a later mode.
|
|
9
|
- Mutual Disambiguation – disambiguation of the input content of a
possible error prone modality from partial information supplied by
another.
- Deictic Gestures – a gesture that contributes to the identification of
an object (or group of objects) merely by indicating their location.
- Referent Object(s) – the object (or group of objects) to which a Diectic
Gesture refers.
- Feature Level Fusion – method for fusing low-level information from
parallel input streams within a multimodal architecture.
|
|
10
|
- Semantic-level Fusion – method of integrating semantic information
derived from parallel input modes in a multimodal architecture.
- Frame-based Integration – pattern matching technique used for merging
attribute value data structures in order to fuse semantic information
from two input modes into a common meaning.
- Unification-based integration – a logic based method used for
integrating partial meaning fragments from two input modes into a common
meaning.
- Windows-Icon-Mouse-Pointer (WIMP) Interfaces – refers to conventional OS
and application interfaces based upon the window and icon metaphor GUI using the
mouse as a pointing device.
|
|
11
|
- …are multimodal interfaces combining speech recognition and gesture
recognition.
- ...are the most pervasive and thoroughly studied multimodal input
combination to date.
- …represent the most accessible multimodal combination as microphones and
touchpad or pen input devices are widely available and often packaged
with current personal computers and laptops. Also, gesture based input
can ;;;;;;;also be provided via conventional mouse input devices.
- ...design and implementation guidelines are far from full maturity
leaving much opportunity for research possibilities.
|
|
12
|
- Important to differentiate between command based and natural language
speech interfaces:
- Command speech involves the recognition of token words or phrases,
whereas
- Natural language involves the recognition of the full sentence
structures and richness of a natural spoken language such as English.
- Voice quality concerns in speech recognition
- Interpersonal variations in voice caused by physical features, cultural
speech patterns, gender, age, native language, and accents.
- Intrapersonal variability with time of day, health, background noise
|
|
13
|
- WIMP based GUIs avoid short term memory requirements by continuously
presenting a discrete set of allowed input choices.
- However for speech based interfaces to benefit from their inherent
expressiveness, memory is relied upon to recall the set of allowed input
commands.
- Thus speech based interfaces often need to provide a guidance system
similar to those presented by GUIs (i.e. “What can I say?”), as an
initial training mode. The
guidance system also help to establish the actual vocabulary available
to the user.
|
|
14
|
- Speech based interfaces also need to address problems arising from
clarification, back-channel utterances, dialog repair, turn taking, and
topic introduction.
- Clarification – repetition of the same command request using new syntax
- Back-channel utterances – “uh huh”, “um”, and “ah”
- Dialog repair – pause in speech to backtrack and rephrase input
- Turn taking – human-computer speech interfaces require complete input
commands versus interactive commands common in human-human interaction
- Topic introduction – can change the semantic meaning of syntactically
similar commands
- The conclusion of an input command utterance also more ambiguous in
speech based input
|
|
15
|
- “Put-That-There” – Bolt 1980
- CUBRICON – Neal 1990
- XTRA – Wahlster 1991
|
|
16
|
- Developed in 1980 by the Architecture Machine Group at MIT
- Led by Richard Bolt
- Most highly referenced multimodal system.
- One of the earliest multimodal interfaces
- Combined voice and gesture based input.
|
|
17
|
- Interaction takes place in a
media room about the size of a personal office.
- Gesture based input is primarily deictic arm movements use to point and
drag objects
- Speech input allows for simple English sentence structures using a
limited vocabulary.
|
|
18
|
- Example scenario consists of input requests for performing the commands
on basic geometric objects as follows:
- creating
- customizing
- moving
- copying
|
|
19
|
- Example speech input commands include:
- “Create a blue square there”
- “Make that smaller”
- “Put that there”
- Diectic gestures used to resolve pronoun ambiguity.
- Uses speech recognition vocabulary to limit input commands thus
improving efficiency.
|
|
20
|
- Developed in 1990 as a military situation assessment tool.
- Led by J. Neal
- Combined spoken and typed natural language with deictic gestures for
both input and output
|
|
21
|
- Similar to “Put-That-There” in that it utilizes pointing gestures to
clarify references to entities based on simultaneous NL input
- Introduced the concept of generating a multimodal language based upon a
dynamic knowledge base, user model, and discourse model
|
|
22
|
|
|
23
|
- The following knowledge sources are used by the parser to interpret the
compound input stream:
- Lexicon – dictionary of words/commands
- Grammar – defines the multimodal language
- Discourse Model – dynamically maintains knowledge pertinent to the
current dialog
- User Model – aid in interpretation based on user profile, goals, and
plans
- Knowledge Base – contains information related to the task space
|
|
24
|
- Mouse input used to select on-screen content such as windows, tables,
icons, and points.
- Spoken and written input to specify action(s) that refer back to
selected objects:
- Multiple object selections allow per spoken input, as in “What are these
items?” while sequentially pointing to several objects.
- Provided initial insight into the use of multiple input modes working
cooperatively to achieve greater accuracy.
|
|
25
|
- Introduced design guidelines for multimodal gesture and speech output
- System visually indicates referent objects in parallel with output
speech
- Indicates containing windows of referent objects
- Weakly highlights multiple windows and icons if more than one instance
of a referent object exists
|
|
26
|
- XTRA (eXpert TRAnslator)
- Introduced in 1991 as an intelligent multimodal interface to expert
systems
- Researched and developed by Wolfgang Wahlster at the German Research
Center for Artificial Intelligence
|
|
27
|
- Builds upon User and Discourse Model Research
- Combines natural language, graphics, and pointing for both input and
output.
- Uses focus granularity based gesture analysis methodology that…
- constrains referents in speech to a gesture based region
- aids the system in interpretations of subsequent definite noun phrases
which refer to objects in the focused area
|
|
28
|
- Use of XTRA to facilitate filling in a tax form.
- Gesture based input and output both occur in the tax form display
section to the left.
- Natural language input and response text are displayed in the panel to
the right.
|
|
29
|
- Four levels of granularity are provided for indicating the selected area
of dialog focus:
- Exact pointing with a pencil
- Standard pointing with the index finger
- Vague pointing with the entire hand
- Encircling a region with an ‘@’ sign
- Also allows three types of gestures: point, underline and encircle.
|
|
30
|
- Dialog focus used to indicate areas of attention.
- Thus focusing gestures modify the discourse model.
- Removes ambiguity of language based references to multiply occurring
fields or objects
|
|
31
|
- Discourse model also modified based upon intrinsic and extrinsic
relations.
- For example the phrase, “the right circle”:
- Intrinsically implies circle 5 if no pointing gesture is present
- Extrinsically implies circle 3 if the pointing gesture is considered
|
|
32
|
- XTRA’s output combines language and gesture based upon the discourse
model.
- Gesture output capability includes the same granular area indicators,
attention specifiers, and intrinsic versus extrinsic assumptions used
for input.
|
|
33
|
- QuickSet – Cohen 1997
- Human-Centric Word Processor – Vergo 1998
- Portable Voice Assistant – Bers 1998
- Field Medic Information System – Holzman 1999
|
|
34
|
- Java-based architecture for providing speech and pen multimodal
interfaces for map-based applications
- Collaborative multimodal system designed to run on multiple platforms
and work with a collection of distributed applications
|
|
35
|
- 2D map oriented design takes advantage of research indicating that
multimodal gesture and speech interfaces are best suited for spatial
interaction.
- Supports gesture symbols for creating and placing entities on the map
- Allows map annotation using points, lines, and areas
|
|
36
|
- LeatherNet is a military set-up and control simulation using the
QuickSet Multimodal Interface architecture
- Symbols used by LeatherNet appear to the left
|
|
37
|
- Provides a general framework to facilitate the inclusion of multimodal
interfaces with 2D map based applications
- Introduces a framework supporting gesture input symbols
|
|
38
|
- Word processing system that allows content to be input via continuous
real-time dictation
- Content can then be edited using a speech and pen based interface.
- Addresses desire to correct and edit text obtained via speech dictation
- Designed by IBM research
|
|
39
|
- Provides a conventional text editing interfaces using the keyboard
- Spatial text editing allows for multimodal pen input to format, correct,
and manipulate dictated text
- Example input command include: “Delete this word”, or “Underline from
here to here”.
- Usability tests showed multimodal text editing capability reduced task
completion relative to unimodal speech editing
|
|
40
|
- Addressed human speech factor of uncertain command conclusion using a
microphone with toggle switch.
- Avoids problems arising from a modeless speech input interface.
- Results showed users quickly adopted modal speech input.
|
|
41
|
- Pen and speech recognition architecture for developing multimodal
interfaces for online web content.
- Designed to run on mobile pen based computers with microphones and a
wireless connection to the Internet (i.e. Tablet PCs).
|
|
42
|
- VoiceLog is a Web applet using PVA
- Allows users to browse an online catalog and order parts.
- Also allows users to inspect technical diagrams and schematics of
machines and parts to determine correct parts
|
|
43
|
- Scenario (U: user, VL: VoiceLog)
- U: “Show me the humvee.”
- VL: displays image of HMMWV
- U: “Expand engine.”
- VL: flashes area of the engine and replaces the image with a part
diagram for the engine
- U: (while pointing at a region with the pen) “Expand this area.”
- VL: Highlights area, the fuel pump, and displays expanded view.
- U: (points to a specific screw in diagram) “I need 10.”
- VL: brings up order form filled out with the part name, part number,
and quantity fields for the item.
|
|
44
|
- Modular multimodal architecture design
- Centralized speech recognition server
- Simple single windowed interface
- Web based architecture
|
|
45
|
- Portable system providing a speech and pen based multimodal interface
allowing medical personnel to remotely document patient care and status
in the field
- Consists of two hardware devices:
- Wearable computer with headset
- Handheld tablet PC
|
|
46
|
- Illustrates the benefits of multimodal interface adaptability by
providing a hands-free speech input device with the ability to switch to
a pen based input device for complex interactions.
- Note that the Field Medic Information system is multimodal, because it
allows the user to choice from multiple inputs modes. However it was not
designed to process these input modes simultaneously.
|
|
47
|
- Human Factor in Integration and Synchronization of Input Modes
- Mutual Disambiguation of Recognition Errors
- Multimodal Interfaces for Children
|
|
48
|
- Study conducted by Oviatt to explore multimodal integration and
synchronization patterns that occur during pen and speech based
human-computer interaction.
- Evaluate the linguistic features of spoken multimodal interfaces and how
the differ from unimodal speech recognition interfaces.
- Determine how spoken and written modes are naturally integrated and
synchronized during multimodal input construction
|
|
49
|
- Users asked perform the real estate agent task of selecting suitable
homes based on a given client profile
- Homes were displayed on a map, and clients specified environment
features they preferred
- Users could interact with interface using speech, gesture, or both.
|
|
50
|
- 100% of users used both spoken and written commands
- Interviews revealed that 95% of users preferred to interact multimodally
|
|
51
|
|
|
52
|
- In 100% of multimodal inputs, pen was used to indicate location and
spatial information
|
|
53
|
- 82% of multimodal input constructions involved draw and speak, while
point and speech represented only 14%.
- Of draw and speak constructions, 42% were simultaneous, 32% sequential,
and 12% were compound (i.e. draw graphic, then point to it while
speaking).
|
|
54
|
- Study conducted by Oviatt to research the benefits of mutual
disambiguation for avoiding and recovering from speech recognition
errors
- Participants were asked to interact with a map-based multimodal system
in which the user sets up simulations for community and fire control
scenarios
- During the course of performing requested simulation set up tasks each
subject entered about 100 multimodal commands
|
|
55
|
- Research the benefits of using mutual disambiguation to reduce
multimodal input recognition error rates.
- Explore differences in mutual disambiguation usage frequency based on
gender and on native versus non-native speakers
|
|
56
|
- Mutual Disambiguation dependent measure was developed
- Rate of mutual disambiguation per subject (MDj)
- Percentage of all scorable integrated commands (Nj) in which
the rank of the correct lexical choice on the multimodal n-best lists (RiMM)
was lower than the average rank of the correct lexical choice on the
speech and gesture n-best lists (Ris and Rig),
minus the number of commands in which the rank of the correct choice was
higher on the multimodal n-best list than its average rand on the speech
and gesture n-best lists:
|
|
57
|
- Measure of Multimodal Pull-Ups –
- Multimodal pull-ups occur during MD when either the correct lexical
choice for speech, for gesture, or for both were retrieved from a worse
ranked position than first choice on their respective n-best lists
- Percentage of correct…
- Unimodal Speech Recognitions
- Unimodal Gesture Recognitions
- Multimodal Speech and Gesture Recognitions
|
|
58
|
- Of correctly recognized multimodal inputs, Mutual Disambiguation was
used to produce 1 in 8.
- MD used to more often to correct non-native speaker inputs
- MD usage differences based on gender were insignificant
- Results summarize in table below:
|
|
59
|
- Multimodal Pull-ups for speech was higher for non-native speakers, 65%,
than for native speakers, 35%.
- Successful speech recognition alone was 72.6% for native speakers and
63.1% for non-native speakers.
- Overall recognition rate remained lower for non-native speakers, 71.7%
versus native speakers, 77%, however this difference is an improvement
over speech recognition alone.
|
|
60
|
- Study by Xiao into the multimodal interface interactions of children
- Performed a comprehensive analysis of children’s multimodal integrations
patterns in a speech and pen based interfaces to compare with the
integration and synchronization patterns of adults.
|
|
61
|
- Participants aged 7 to 10 asked to interact with a science education
application called Immersive Science Education for Elementary Kids (I
SEE!).
- I SEE! teaches about marine biology through animated marine animals
- Input involved asking marine animals questions formulated using speech,
pen, or both.
- Speech input recognition was initiated by a pen tap or writing.
|
|
62
|
- 6500 usable child utterances were obtained
- Unimodal speech usage dominated child utterances:
- 10.1% multimodal
- 80.4% unimodal speech
- 9.6% pen input alone
- 93.4% of multimodal pen input involved abstract scribbling
- Of interpretable pen input, 93.2% complemented correlating speech input.
|
|
63
|
- Similar to multimodal interaction patterns of adults:
- Dominant multimodal integration type was often adopted early in the
interaction session.
- Pen preceded speech in most multimodal utterances.
- Multimodal input was used to convey complementary semantic information.
- Sequential multimodal lag averaged 1.1 sec and ranged from 0 - 2.1
seconds which is consistently less than adults.
- Children showed a higher tendency to engage in manual pen activity that
adults.
|
|
64
|
|
|
65
|
- Research challenges involved in advancing the field of multimodal
interfaces include:
- Creation of tools which facilitate the development of multimodal
software
- Development of metrics and techniques for evaluating alternative
multimodal systems
- Study of how people integrate and synchronize multimodal input during
human-computer interaction.
|
|
66
|
- Design of an Application Programming Interface (API) Framework for
developing speech and gesture based multimodal interfaces with a focus
on interfaces involving interaction with Web accessible 3D environments.
- Implement an illustrative application using the proposed framework to
study multimodal integration patterns in children and the efficacy of
mutual disambiguation usage with children.
|
|
67
|
- Provide unimodal gesture and speech recognition as well as multimodal
combined speech and gesture recognition support for:
- Navigating, creating, and populating 3D environments
- Controlling and directing inhabitants or avatars in such environments
- Web based support for applications using unimodal and multimodal speech
and gesture recognition
- Automated statistics gathering tools for both unimodal and multimodal
speech and gesture usage as well as multimodal disambiguation statistics
|
|
68
|
- Laptops and tablet PCs are well suited for multimodal speech and gesture
based interfaces as they include built-in:
- Microphones
- Gesture supporting input devices such touch pads, TrackPoints, pens,
and styli.
- A framework goal is to ensure support of laptops and tablet PCs.
|
|
69
|
- Rutgers University Center for Advanced Information Processing (CAIP): Framework
for Rapid Development of Multimodal Interfaces.
- AMBIENTE – Workspaces for the Future: Multimodal Interaction Framework
for Pervasive Game Applications called STARS
- Pennsylvania State University and Advanced Interface Technologies, Inc.:
A real-time Framework for Natural Multimodal Interaction with Large
Screen Displays.
- W3C - Multimodal Framework Specification for the Web (unimplemented).
|
|
70
|
- Java based API for multimodal interface development.
- Highly generic late fusion multimodal integration agent for managing
diverse input modes
- Differences from proposed framework:
- Web accessibility not considered
- No statistics collection
- No focus on 3D interaction
|
|
71
|
- .NET based framework supporting multiple input modes for interactivity
with board games and tabletop strategy or role playing games.
- Focus on support for social activity inherent in such games via
networking.
- Differences from proposed framework:
- Highly generic mode support
- No statistics collection
- Focus on 2D discrete gaming environments
|
|
72
|
- Java based framework to facilitate speech and gesture based multimodal
interfaces involving a large screen display
- Design to facilitate multimodal interface design research for large
screen displays
- Differences from proposed framework:
- Focuses on multimodal interface output
- No statistics collection
- Not focused on 3D environment interaction
|
|
73
|
- Validate research into multimodal interfaces framework design and the
need for such interfaces
- Provide methodologies and design approaches for developing multimodal
frameworks.
- Reveal unexplored aspects of
multimodal framework design.
|
|
74
|
- Java based API
- Speech Recognition (JSAPI)
- Gesture Recognition (Customized HHReco)
- Multimodal Integration
- 3D Rendering Environment (Perlin Java 3D Renderer)
|
|
75
|
- Supports command and control recognizers, dictation systems, and speech
synthesizers
- Speech recognition for web accessible applets and applications
- Requires a JSAPI compatible implementation and native speech recognition
engine
- Native speech recognition engine can be packaged with applet or
application and automatically download via Java WebStart for convenient
web accessible applets and applications
|
|
76
|
- Same format to be made available to users of the framework for
specifying valid speech and multimodal input commands.
- JSGF “Put-That-There” Example:
#JSGF v1.0
grammar PutThatThere;
public <Command> = <Action> <Object>
[<Where>];
<Action> = put
{PUT_ACTION} | delete {DEL_ACTION};
<Object> =
<pronoun> | ball {BALL_OBJECT} |
square
{SQUARE_OBJECT};
<Where> = (above |
below | to the left of | to the right of)
<Object>;
<Pronoun> = (that | this | it) {OBJECT_REFERENT} |
there
{LOCATION_REFERENT}
|
|
77
|
- A JSGF file can be loaded as the foundation for a recognized command
grammar
- However this grammar can be dynamically modified by the application for
updating the command grammar based on changes to the user model,
discourse model, or knowledge database.
|
|
78
|
- CloudGarden’s TalkingJava SDK
- Full JSAPI specification support
- Integration capability with Microsoft Speech Application SDK
- Microsoft Speech Application SDK
- Text-to-speech engine supporting multiple voices
- User trainable speech recognition engine
- Supported speech input devices will include built-in laptop or tablet PC
microphones, desktop microphones, and headset microphones.
|
|
79
|
- Input handler to be implemented for non-symbol gestures including:
- Deictic gestures
- Spatial gestures
- Conventional mouse equivalent gestures
- Symbolic gestures handler through modified HHreco symbol recognition
toolkit
- Java API
- Adaptive multi-stroke symbol recognition system
- Designed for use off-the-shelf or customized
- Supported gesture input devices will include conventional mouse,
touchpad, TrackPoint, and stylus.
|
|
80
|
- Unification is an operation that takes multiple inputs and combines them
into a single interpretation
- Framework will utilize semantic unification to produce an N-best list
consisting of plausible unimodal and multimodal interpretations
- To be implemented partially based on QuickSet multimodal unification
methodology:
- Described by Cohen and Johnston
- Based on a typed feature structure (FS) consistently designed across
unimodal and multimodal input modes.
|
|
81
|
|
|
82
|
- Feature structure resulting from user drawing cube symbol gesture at (x,
y z) world coordinates (100.0, 0.0, 50.0) of size (2.0, 2.0, 2.0) at
time 10min, 20s, 200ms.
- Gesture_FS {
Classification:
OBJECT
Type: CUBE
Command: INSTANTIATE
Location: (100.0, 0.0, 50.0)
Size: (2.0, 2.0, 2.0)
Time: 10:20:200
. . .
}
|
|
83
|
- Feature structure resulting from user saying, “Create green cube”
- Speech_FS {
Classification:
OBJECT
Type: CUBE
Command: INSTANTIATE
Color: green
Time: 10:14:200
. . .
}
|
|
84
|
- Multimodal feature structure resulting from integrating prior speech and
gesture feature structures:
- Multimodal_FS {
Classification:
OBJECT
Type: CUBE
Command: INSTANTIATE
Color: green
Location: (100.0, 0.0, 50.0)
Size: ( 2.0, 2.0, 2.0)
Time: 10:14:200
. . .
}
|
|
85
|
- Consistent semantic level feature structures passed to multimodal
integration agent from speech and gesture recognition components.
- If multiple input interpretations are plausible an N-best list is
produced.
- Selection of final interpretation from N-best list is based on default
priorities or customized priorities specified by the framework API user.
- This customization ability allows for testing various prioritization
settings.
|
|
86
|
- Java compatible API for software 3D rendering.
- Ability to create 3D content using primitives, such as cubes, cones,
cylinders, and spheres, as well as more complex object defined on a per
vertex and normal basis.
- Supports multiple directional lights and full camera location,
orientation, and aim fucntionality.
- Add-on utilities support importing VRML2.0 and Half-Life models.
- Support for native accelerated graphics cards through Java OpenGL (JOGL)
library.
|
|
87
|
- Communicates with multimodal integration agent to collect user input
statistics.
- Statistics based on feature structures generated by speech, gesture, and
multimodal integration components.
- Statistics collected include:
- Percentage of each input type usage
- Elapsed time between modalities for multimodal input
- Percentage of correct results in interpreting each input
- Mutual disambiguation usage statistics
- Gathered statistics logged in a server database, file, or passed to user
via form mailer.
|
|
88
|
- Implement multimodal interface for RAPUNSEL project using framework API
- Analyze both unimodal and multimodal gesture and speech usage patterns
in middle school aged children
- Explore benefits of mutual disambiguation use with children.
|
|
89
|
- Provide an representative application built upon the proposed framework
API
- Illustrate the use of the framework’s statistics gathering capabilities
to acquire data to:
- perform an analysis of children’s interactions with multimodal
interfaces, and
- investigate the benefits mutual disambiguation use with children
- Publish results of this analysis such that they contribute to the design
of multimodal interfaces for children involving speech and gesture
recognition
|
|
90
|
- Real-time Applied Programming for Underrepresented Children’s Early
Literacy.
- Ongoing three-year NSF funded research project.
- Software environment to introduce programming to middle school aged
children through a socially engaging networked game.
- Children adopt interactive animated characters that they can engage in
adventures within a 3D world.
|
|
91
|
- Allow child users to interact multimodally with the 3D RAPUNSEL world to
- navigate his or her character,
- create and interact with other characters and objects, and
- ask questions about objects and the relationship between objects.
|
|
92
|
- C: “Make green ball” Makes circular symbol gesture at location somewhere
in 3D world scene.
(new ball appears at the gestured to location)
- C: “No! Make it red.”
- (same ball turns red)
- C: “Make a river like this.” Draws rivers path along the ground.
- (a river following that path appears between child’s character and the
ball)
- C: “Walk to it.” No gesture.
‘it’ is assumed to be the river based on discourse model.
(character walks to river)
- C: “No! Walk to the [unintelligible word].” Points at the ball. Mutual
disambiguation used to establish unintelligible word was ‘ball’
(Character shakes head as it cannot cross the river)
- C: “Can you cross this?” Points to character and then gestures along
river’s path.
- (Character speech balloon
appears stating a bridge is required to cross the river)
- C: “Create a bridge in front of Wobble.” No Gesture. Wobble is the
character’s name.
(A bridge appears in front of Wobble)
|
|
93
|
- navigating and directing child’s character, e.g., “Walk to the ball.” or
“Pick up the ball.”;
- creating objects, e.g., “Create a ball in front of the tree.”;
- changing objects, “Make it
green.”; and
- asking questions, “Can my character cross the river?”
|
|
94
|
- navigating and directing child’s character, e.g. dragging gesture from
character’s location to another and pointing to an object to be picked
up;
- creating objects, e.g. circular symbol gesture to create a ball at the
location of the gesture, or river symbol gesture draw along a path to
create river; and
- moving objects, e.g., pointing to and dragging an object from one
location to another.
|
|
95
|
- navigating and directing child’s character, e.g. “Walk to this.” while
pointing to an object to be walked to, or “Pick the ball up.” while
pointing to a particular ball;
- creating objects, e.g. “Create a green ball here.” while pointing to a
location;
- changing objects, e.g. “Make this bridge wider.”, while dragging along
the width of the bridge;
- moving objects, e.g. “Move this from here to here” as child first points
to object then points to the new desired location; and
- asking questions, e.g. “What is this?” said whilst pointing to an
unknown object.
|
|
96
|
- Objectives
- Collect and analyze mutual disambiguation statistics obtained from
RAPUNSEL interaction session.
- Study differences in the efficiency of mutual disambiguation for
children versus adults.
- Motivations
- Children tend to gesture more often than adults in multimodal
interfaces.
- Unimodal speech recognition is more difficult with children than adults
- Hypothesis is that opportunities to take advantage of mutual
disambiguation will occur more often as more speech recognition error
occur and more gesture input is available
|
|
97
|
- Children will participate in sessions where they be asked to interact
with their character in the RAPUNSEL 3D world to accomplish a given set
of tasks including:
- various character navigation goals
- creation and modification of objects
- acquisition of answers to questions about objects in the world
|
|
98
|
- Each interaction session will require ~30 middle school aged volunteer.
- Multiple session will be conducted simultaneously in computer classes
during regular school hours.
- The set of students working on a particular session will be randomly
assigned.
- RAPUNSEL project subject pool is already much larger.
|
|
99
|
- Independent Variables
- Tasks to be accomplished during each interaction session.
- Available time for each interaction session
- Dependent Variables
- Unimodal gesture input usage percentage
- Unimodal speech input usage percentage
- Multimodal input usage percentage
- Elapsed time between modalities in multimodal inputs.
- Mutual disambiguation
- Usage percentage
- Pull-ups
|
|
100
|
- Desktop, Laptop, or Table PC with
- Multimodal RAPUNSEL interface software
- Supporting libraries
- Built-in or external microphone
- Mouse, stylus, touchpad, or TrackPoint
|
|
101
|
|
|
102
|
|
|
103
|
|
|
104
|
- Multimodal framework API with documentation.
- NSF grant proposal.
- Multimodal framework API report.
- RAPUNSEL application version with multimodal interface.
- Statistics gathered from RAPUNSEL multimodal interface usability test.
- Report on multimodal usage patterns and mutual disambiguation for
children.
- Ph.D. Doctoral Dissertation covering this effort in detail.
|