MultiModal Applications: Here Today, Gone Tomorrow
From MobileDesign
While multi-modal technology greatly enhances the user experience of mobile applications and web sites, it is rarely actually implemented outside of voice dialing. We believe this is due to the development costs associated with enhancing a web site or application with speech I/O. [V-Enable] now provides the technology to make this easy, but we are afraid that the company won't survive.
Contents |
[edit] MultiModal Browser
Mobile phones have several features that distinguish them from desktops: they have small screens, limited text input, intermittent network connections, and users who are variously highly distracted and highly engrossed. One capability that phones have and desktops do not is ubiquitous voice input. Speech input is usually significantly easier than other forms of input.
An alternate browser design might have the following features:
- Standard visual display and button or stylus input for every page
- Optional audio reading of any given screen
- Audio components of a web page as determined by the page designer (e.g., SMIL)
- Speech input for every page, accessed via a "Talk to Me" button. We don't want to leave the microphone open all the time, but we can't assume that the user is going to be able to respond exactly when the phone is expecting it. The increasingly popular Push-to-Talk buttons would be one possible candidate for this function.
This mix of inputs and outputs allow the browser to be used walking down the street or while in a meeting. Designing sites for such a multimodal browser would not be challenging, but could not be done in the same fashion as designing sites for desktop browsers.
[edit] MultiModal Technology
The technology side is not difficult. VoiceXML and servers from companies such as [Nuance] provide speech processing in a structured environment. Alternately, [X+V (xHTML + VoiceXML)] could be used. The device would have to capture the speech wave form and transmit it across the network, but this is done all the time.
A key problem with integrating speech recognition into browsing is the question of where to process the speech itself. While there are large-vocabulary embedded speech recognition technologies, such as that provided by [VoiceSignal], they tend to focus on a fixed large vocabulary. Browsing and downloaded applications need different vocabularies for each screen, preventing the optimization necessary to make embedded recognition work well. We therefore need to rely on a server solution for speech processing. Fortunately, the experience companies like [Nuance] have with VoiceXML will serve well.
One solution is proposed by [kirusa], who has a server-side X+V solution that will theoretically work with existing browsers. Their current solution relies on devices to handle voice calls and data connections simultaneously and switch between them gracefully. Our experiences working with various devices suggest that this is a bit optimistic. Voice calls will take over the screen, so whenever a visual component needs to be displayed the voice call has to drop. However, it is an adequate solution that works on existing networks and devices ??? not vaporware. They are undoubtedly working on convincing carriers to install a true X+V browser, with their gateway, on future phones.
[IBM] has worked with [Access] and [Opera] to create multimodal browsers on large PDA platforms, but have not yet built these browsers for mainstream devices (such as Palm and Symbian). The voice processing is done at the server and not on the local device. Expect these known phone browser vendors to make better inroads into multimodal browsers than will kirusa.
[edit] MultiModal Applications
[V-Enable] is addressing the application side of the multimodal space. They have an API, including the server functionality needed, written for BREW, Symbian, and J2ME MIDP2 (announced in time for the Oct. 2004 CTIA Wireless & IT show). Again, they are processing the speech at the server. The API has a lot of flexibility, including the handling of multiple possible recognitions for a single utterance ??? something not built directly into VoiceXML.
It would be straightforward to create a multimodal browser, or any other multimodal application using V-Enable.
Unfortunately, we doubt that they will survive. While they are actively pursuing patents, the discussion above suggests a lot of prior art in the general space. Large application providers can replicate the technology; smaller application providers may not be able to afford the licensing fees in a space where people have difficulting monetizing their applications. Some technology platform company may attempt to buy them, or will just replicate the technology and have a larger brand and customer base to work with. In fact, IBM has a large J2ME presence; they may just be the people to buy the technology.
[edit] Conclusion
So why has no phone browser used both text and voice input and output? The technology is available.
We believe it is because few people have figured out how to monetize a wireless site, regardless of this extra feature. Combined with the antipathy the market has towards WAP, few people are interested in the additional investment.
Applications -- including games -- are a little easier to monetize. It may be difficult to convince users to pay the extra money for a voice-enabled application without significant education. What we need is a major player, such as AOL or Yahoo, to lead the way.

