Muftasoft TM

http://www.angelfire.com/biz/khawar/voice_browser.html

W3C User Interface Domain “Voice Browser” Activity

  NEWS

      Speech Synthesis and Speech Grammars specs enter W3C Last Call review!
      New draft for Stochastic Language Models based upon N-Gram formalism.

Introduction

W3C is working to expand access to the Web to allow people to interact with Web sites via spoken commands, and listening to prerecorded speech, music and synthetic speech. This will allow any telephone to be used to access Web-based services, and will be a boon to people with visual impairments or needing Web access while keeping theirs hands & eyes free for other things. It will also allow effective interaction with display-based Web content in the cases where the mouse and keyboard may be missing or inconvenient.

We have set up a public mailing list for discussion of voice browsers and our work in this area. To subscribe send an email to www-voice-request@w3.org with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for the list is acccessible online. Mission Statement

Far more people today have access to a telephone than have access to a computer with an Internet connection. In addition, sales of cellphones are booming, so that many of us have already or soon will have a phone within reach wherever we go. Voice Browsers offer the promise of allowing everyone to access Web based services from any phone, making it practical to access the Web any time and any where, whether at home, on the move, or at work.

It is common for companies to offer services over the phone via menus traversed using the phone's keypad. Voice Browsers offer a great fit for the next generation of call centers, which will become Voice Web portals to the company's services and related websites, whether accessed via the telephone network or via the Internet. Users will able to choose whether to respond by a key press or a spoken command. Voice interaction holds the promise of naturalistic dialogs with Web-based services.

Voice browsers allow people to access the Web using speech synthesis, pre-recorded audio, and speech recognition. This can be supplemented by keypads and small displays. Voice may also be offered as an adjunct to conventional desktop browsers with high resolution graphical displays, providing an accessible alternative to using the keyboard or screen, for instance in automobiles where hands/eyes free operation is essential. Voice interaction can escape the physical limitations on keypads and displays as mobile devices become ever smaller.

Hitherto, speech recognition and spoken language technologies have had for the most part to be handcrafted into applications. The Web offers the potential to vastly expand the opportunities for voice-based applications. The Web page provides the means to scope the dialog with the user, limiting interaction to navigating the page, traversing links and filling in forms. In some cases, this may involve the transformation of Web content into formats better suited to the needs of voice browsing. In others, it may prove effective to author content directly for voice browsers.

Information supplied by authors can increase the robustness of speech recognition and the quality of speech synthesis. Text to speech can be combined with pre-recorded audio material in an analogous manner to the use of images in visual media, drawing upon experience with radio broadcasting. The lessons learned in designing for accessibility can be applied to the broader voice browsing marketplace, making it practical to author content that is accessible on a wide range of platforms, covering voice, visual displays and Braille.

W3C held a workshop on “Voice Browsers” in October 1998. The workshop brought together people involved in developing voice browsers for accessing Web based services. The workshop concluded that the time was ripe for W3C to bring together interested parties to collaborate on the development of joint specifications for voice browsers, particularly since these efforts concern subsetting or extending some of the core W3C technologies, for example HTML and CSS. As a response, an activity proposal was written to establish a W3C “Voice Browser” Activity and Working Group.

Following review by W3C members, this activity was established on 26 March 1999. The W3C staff contact, and activity lead is Dave Raggett (W3C/Openwave). The chair of the Voice Browser working group is Jim Larson (Intel).

To get a feeling for future work, the W3C together with the WAP Forum held a joint workshop on the Multimodal Web in Hong Kong on 5-6 September 2000. This workshop addressed the convergence of W3C and WAP standards, and the emerging importance of speech recognition and synthesis for the Mobile Web. As an outcome of this workshop, W3C is now drafting a charter for a proposal for setting up a new working group dedicated to developing specifications for multimodal dialogs.

For a presentation on the Voice Browser Activity, see the Developer's Day talk given on 19th May 2000 in the Mobile track at WWW'9 conference held in Amsterdam, See also Tomorrow's Web, presented at WWW'9 on May 16, and covering the challenges of dealing with an every increasing range of ways of accessing the Web.

See also the talk given to the WAP Forum, in London on 15th September 1999. Requirements and Working Draft Language Specifications

The W3C development process is described in the W3C process document. It defines a series of working draft documents followed by a final call working draft, candidate recommendation, proposed recommendation, and finally a recommendation.

The Voice Browser Working Group has specified the following working draft requirement documents and working draft language specification documents, which are available at the Voice Browser Working Group page (W3C members only) Requirement Document Markup Language Specification Date and Status of Current Specification Date and Status of Next Specification Speech grammars Speech Recognition Grammar Last Call Working Draft January 3rd, 2001 Recommendation Est. November, 2001 Stochastic Language Models (N-Gram) Working Draft January 3rd, 2001 Recommendation Est. November, 2001 Semantic Interpretation Markup Language NA Working Draft expected by March 2001 Voice dialogs Dialog Markup Language NA Working Draft (Corrections and updates to VoiceXML) expected January 2001 Speech synthesis Speech Synthesis Markup Language Last Call Working Draft January 3rd, 2001 Recommendation Est. November, 2001 Natural language representation Natural Language Semantics Markup Language Working Draft Nov 20, 2000 Last call Est. June 2001 Multimodal systems Multimodal Dialog Markup Langauge We anticipate that a new Working Group will be formed to take over the specification of the Multimodal Dialog Markup Language. NA Reusable dialog components Reusable Dialog Components NA Initially part of Dialog ML Working Draft; eventually individual reusable components will be defined in a separate document.

The various requirement documents use the following nomenclature to describe the priorities that describe if particular sections of the requirements draft document should be present in the final, official version of the document. Note that these modifiers do not pertain to the desired features in any implementation, but only to the document itself. Must address The first official specification must define the feature. Should address The first official specification should define the feature if feasible but may defer it until a future release. Nice to address The first official specification may define the feature if time permits, however, its priority is low. Future revision It is not intended that the first official specification include the feature. Implementations

Several venders have implemented VoiceXML 1.0 and are extending their implementations to conform with the markup languages in the W3C Speech Interface Framework. To be listed here, the implementation must be working and available for use by developers.

Tellme Studio allows anyone to develop their own voice applications and access them over the phone just by providing a URL to your content. Visit http://studio.tellme.com to begin. The Tellme Networks voice service is built entirely with VoiceXML. Call 1-800-555-TELL to try this service.

Motorola has the Mobile Application Development Toolkit (MADK), a freely downloadable software development kit that supports VoiceXML 1.0 (as well as WML and VoxML). See http://www.motorola.com/MIMS/ISG/spin/mix/.

IBM Voice Server SDK Beta Program is based on VoiceXML Version 1.0 available at http://www.alphaworks.ibm.com/tech/voiceserversdk.

Nuance offers graphical VoiceXML development tools, a Voice Site Staging Center for rapid prototyping and testing, and a VoiceXML-based voice browser to developers at no cost. See the Nuance Developer Network at http://extranet.nuance.com/developer/ to get started.

General Magic, http://www.generalmagic.com, has also implemented a version of VoiceXML 1.0.

VoiceGenie is sponsoring a developer challenge in association with VoiceXMLCentral, a VoiceXML virtual community and search engine. For details on the challenge, see: http://developer.voicegenie.com. Frequently asked questions

Q1. Why not just use HTML instead of inventing a new language for voice-enabled web applications?

A1. HTML was designed as a visual language with emphasis on visual layout and appearance. Voice interfaces are much more dialog oriented, with emphasis on verbal presentation and response. Rather than bloating HTML with additional features and elements, new markup languages were especially designed for speech dialogs.

Q2. How does the W3C Voice Browser Working Group relate to the VoiceXML Forum.

A2. The VoiceXML Forum developed the dialog language VoiceXML 1.0, which it submitted to the W3C Voice Browser Working Group. The Voice Browser working group used those specifications as a model for the Dialog Markup Language. In addition, the Voice Browser Working Group has augmented the Dialog Markup Language with Speech Recognition Grammar Markup Language and the Speech Synthesis Markup Language. The VoiceXML Forum provides educational, marketing, and conformance testing services. The two groups have a good working relationship, and work closely together to enhance the ability of developers to create web-based voice applications.

Q3. What is the difference between VoiceXML, VXML, VoXML, and all the other voice mark up languages?

A3. Historically, different speech companies created their own voice markup languages with different names. As companies integrated languages together, new names were given to the integrated languages. The IBM original language was SpeechML. AT&T and Lucent both had a language called PML (Phone Markup Language, but each had different syntax. Motorola's original language was VoxML. IBM, AT&T, Lucent, and Motorola formed the VoiceXML Forum and created VoiceXML (briefly known as VXML). HP Research Labs created TalkML. The World Wide Web Consortium Voice Browser Working Group has specified Dialog ML, using VoiceXML as a model.

Q4. Will WAP and Dialog Markup Language ever be integrated into a single language for specifying a combined verbal/visual interface?

A4. The Wireless Markup Language (WML) and Dialog Markup Language were defined by different standards bodies. A joint W3C/WAP workshop was recently held to address this question. Some difficult problems to integration were identified, including differences in architecture (WAP is a client-based browser, VoiceXML is a server-based browser), as well as differences in language philosophy and style. The workshop adopted the “ Hong Kong Manifesto,” which basically states that a new W3C working group should be created to address this problem and coordinate activities to specify a multimodal dialog markup language supporting both visual and verbal user interfaces. The W3C Voice Browser Working Group has also approved the “Hong Kong Manifesto.” We anticipate that a new working group will be organized in the next few months.

Q5. What is the difference between Dialog Markup Language and SMIL?

A5. Synchronized Multimedia Integration Language (SMIL, pronounced “smile”) is a presentation language that coordinates the presentation of multiple visual and audio output to the user. Dialog Markup Language coordinates input from the user and output to the user. Eventually the presentation capabilities of SMIL should be integrated with the output capabilities of Dialog Markup Language.

Q6. Where can I find specifications of the W3C Speech Interface Framework markup languages and how do I provide feedback to the W3C Voice Browser Working Group?

A6. The W3C Voice Browser Working Group's web page is www.w3.org/voice/. Current drafts of the W3C Speech Interface Framework markup languages can be found there. Comments and feedback may be e-mailed to www-voice@w3.org.

Q7. What speech applications can not currently be supported by the W3C Speech Interface Framework.

A7. While the W3C Speech Interface Framework and its associated languages support a wide range of speech applications in which the user and computer speak with each other, there are several specialized classes of applications requiring greater control of the speech synthesizer and speech recognizer than supported in the current languages. The Speech Grammar Markup Language does not currently support the fine granularity necessary for detecting speech disfluencies in disabled speakers or foreign language speakers that may be required for “learn to speak” applications. There are currently no mechanisms to synchronize a talking head with synthesized speech. The Speech Synthesis Markup Language is not able to specify melodies for applications in which the computer sings. We consider the Natural Language Semantics a first step towards specifying semantics of dialogs. Because there is no context or dialog history databases defined, extra mechanisms must be supplied to do advanced natural language processing. Speaker identification and verification and advanced telephony commands are not yet supported in the W3C Speech Interface Framework. Developers are encouraged to define objects that support these features.

Q8. When developing an application, what functions functions and features belong in the application and what functions and features belong in the browser?

A8. A typical browser implements a specific set of features. We discourage developers from reimplementing these features within the application. New features should be implemented in the application. If and when several applications implement a new feature, the Working Group will consider placing the features in a markup language specification, and encouraging updates to browsers that incorporate the new feature. We discourage developers from creating downloadable browser enhancements because some browsers may not be able to accept downloads, especialy browsers embedded into small devices and appliances.

Q9. What is the relationship between the Dialog Markup Language and programming languages such as Java and C++?

A9. Objects may be implemented using any programming language.

Q10. How has the voice browser group addressed accessibility?

A10. The voice browser group's work on speech synthesis markup language brings the same level of richness to synthesized aural presentations that users have come to expect with visual presentations driven by HTML. In this respect, our work picks up from the prior W3C work on Aural CSS. Next, our work on making speech interfaces pervasive on the WWW has an enormous accessibility benefit; speech interaction enables information access to a significant percentage of the population that is currently disenfranchised.

As the voice browser group, our focus has naturally been on auditory interfaces, and hence all of our work has a positive impact on the user group facing the most access challenges on the visual WWW today –namely blind and low vision users. At the same time we are keenly aware of the fact that the move to information access via the auditory channel raises access challenges for users with hearing or speaking impairments. For an hearing-impaired user, synthesized text should be displayed visually. For a speaking-impaired user, verbal responses may instead be entered via a keyboard.

Finally, we realize that every individual is unique in terms of his or her abilities; this is likely to become key as we move towards multimodal interfaces which will need to adjust themselves to the users current environment and functional abilities. Work on multimodal browsing will address this in the context of user and device profiles. Dave Raggett dsr [at] w3 [dot] org, W3C $Date: 2001/01/04 12:51:42 $