Published on Nov 23, 2015


Voice based web access is a rapidly developing technology. PhoNET is a solution for these and many other problems faced by the netizens. The basic idea is that using an ordinary phone to browse the web and the primary motivations are: to provide a widely available means for creating new interactive voice applications; addressing needs for mobility; and addressing issues inaccessibility.

Basis of the idea are the age old IVR systems used to serve information for the dialers through a pre programmed process. Phonet is a very long journey from the IVRs; it involves the most complex technologies of the century Like Speech Recognition (SR), Text to speech (TTS) conversion and artificial intelligence (AI). This enables a user to be connected to internet as long as he has access to a phone. PhoNET uses the traditional HTML content so the web site need not be rewritten or redesigned.

We present a detailed analysis in the most possible simplest way of how the technologies like SR, TTS and AI are integrated to develop a intelligent Platform (PhoNET) to achieve voice based web access which involves Document processing and Document Rendering. In Document Processing we describe two approaches, telephone browsing and transcoding, focusing mostly on the former since that work is more mature. In Document Rendering we present the major problem i.e., the relevance of cognitive thought to text rendering along with its most suitable solution. In the end we examine the challenges and further developments involved in practical application of the proposed technology-The PhoNET


Today’s telecom business has seen recent growth, especially in bandwidth infrastructure for long distance (LD) and data. The industry is currently experiencing strong growth in the wireless segment as mobile devices prove to be very popular with both consumers and business. An evolving market segment is “Internet anywhere,” and many companies are trying approaches to present viable products for this market.

One approach is Internet access over wireless devices such as cell phones with a screen. However, this method has inherent limitations such as small screen size, lack of a keyboard, the need for a special device (web-enabled phone), the need to rewrite and maintain a special website, and severe bandwidth constraints using wireless data transfer protocols.

Another approach that is becoming popular is voice-based limited Internet access, which overcomes all of the limitations of the wireless data devices but one; they still limit access to the few sites that are re-engineered for voice. They typically deliver content such as news, weather, horoscopes, and stock quotes, etc. over the phone. These companies are called “Voice Portals.” Voice portals were the first web applications that tried to integrate websites with voice which gave birth to the enterprise based PBX systems.

Our solution, which presents a third option, gives users all of the benefits of the voice portals, yet has complete access to the entire Internet without limitation. With our Voice Internet technology PhoNET, anyone can surf, search, send and receive email, and conduct e-commerce transactions, etc. using their voice from anywhere using any phone, with the more freedom of movement than a standard Internet browser which requires a PC and an Internet connection.

PhoNET technology is faster and cheaper than existing alternatives. Today, only the largest of companies are making their Web sites telephone-accessible because existing technology requires a manual, costly and time-consuming re-write of each page. With the voice internet technology-PhoNET, existing Web pages are used, allowing users to leverage their Web investment. The software dynamically converts existing pages into audio format, significantly lowering the up-front investment a business must make to allow users to hear and interact with their Web site by phone.


The primary method of access today continues to be the computer, which has certain advantages as well as some limitations. Computers offer a visual Internet experience that is usually rich in content. Some basic computer skills and knowledge are needed to access the Internet. But, computer-based access is proving insufficient for the professional on the move. When in the car or away from the office or computer, accessing the Web is difficult, if not impossible. And, an increasing number of people prefer an interface that allows them to hear and speak rather than see and click or type.

Some existing Internet users have also identified problems with the visual Internet experience. Pages are increasingly full of graphics, advertisement banners, etc., which move, flash, and blink as they vie for attention. Some find this “information overload” annoying, and lament the delays it creates by severely taxing the available bandwidth.

The "Digital Divide"

While computers and their use are on the rise, they’re not ubiquitous yet. A large segment of the population still doesn’t have access in the United and other parts of the world. Thus, Internet is limited to only a small fraction of the world population; the majority is left out from the Internet. This gap between those who can effectively use new information from the Internet, and those who cannot is known as the digital divide. Bridging this digital divide is the key to ensure that most people in the world have the capability to access the Internet. Making computers ubiquitous is not a very attractive and feasible solution, at least in the near future, because of various barriers. One key barrier is cost, although the price of a computer has come down significantly in recent years. Internet as well, thus bridging the Digital Divide.

The "Language Divide"

Today more than eighty percent of website contents are written in English language. Internet because of language barrier is called "The Language Divide".

As the need for alternative access to the Internet becomes more evident, several technology companies are pursuing solutions. Their products include “smart” cell phones with visual displays, intelligence built into the handset, and voice-activated Web sites. These products address different aspects of the problems outlined above

Technology Overview

The idea of listening to the Internet may at first sound a bit like watching the radio. How does a visual medium rich in icons, text, and images translate itself into an audible format that is meaningful and pleasing to the ear? The answer lies in an innovative integration of three distinct technologies that render visual content into short, precise, easily navigable, and meaningful text that can be converted to audio.

The technologies and steps employed to accomplish this feat are:

Document Processing

1. Speech recognition
2. Text-to-speech translation, and Document Rendering
3. Artificial Intelligence

The PhoNET platform acts as an “Intelligent Agent” (IA) located between the user and the Internet (Figure 1). The IA automates the process of rendering information from the Internet to the user in a meaningful, precise, easily navigable and pleasant to listen to audio format. Rendering is achieved by using Page Highlights (a method to find and speak the key contents on a page), finding right as well as only relevant contents on a linked page, assembling right contents from a linked page, and providing easy navigation.

These key steps are done using the information available in the visual web page itself and proper algorithms that use information such as text contents, color, font size, links, paragraph, and amount of text. Artificial Intelligence techniques are used in this automated rendering process. This is similar to how the human brain renders from a visual page; selecting the information of interest . The IA includes a language translation engine that dynamically translates web contents from one language into another in real time.

The platform incorporates the highest quality speech recognition and text to speech engines from third party suppliers.


Document Processing

Document analysis is performed in the HTML parser, grammar generator, and Hyper Voice processor modules. The typical HTML Web page is first parsed into a list of elements based mostly on the HTML tags structure. Some elements are aggregations (tables, for instance) but the element list is not a full parse tree, which we found was not needed and in some cases actually complicates processing. Images, tables, forms and most text structure elements like paragraphs are recognized and processed according to their recognized type. Much of the effort in building a robust HTML processor is dealing with malformed HTML expressions such as unclosed tag scope, overlapping tag scopes, etc. Unfortunately space does not allow for fully addressing this issue here. The location of each image is announced along with any associated caption.

This feature can be disabled on a site-by-site basis when the user does not want to hear about images. Tables are first classified according to purpose, either layout or content. Most tables are actually used for page layout which can be recognized by the variety and types of data contained in the table cells. Data tables are processed by a parser according to one of a set of table model formats that Phone Browser recognizes. This provides primarily a simple way of reading the table contents row by row, which is often not very satisfying. Alternatively a transcoder can be used to reconstruct the table in sentential format. While large vocabulary dictation speech systems are available, most require speaker training to achieve sufficiently high accuracy for most applications.

Phone Browser is intended to be immediately usable without training so dictation is not yet supported. This also implies that creating arbitrary text for messaging is also not yet supported. One additional type of form input is an extension to HTML. A GSL (Grammar Specification Language) or JSGF (Java Speech Grammar Format) specification can be inserted into an HTML anchor using an attribute tag (currently LSPSGSL). Using this method an application can specify an elaborate input grammar allowing many possible sentences to address the associated hyperlink and construct a GET type form response where the QUERY_STRING element is constructed by inserting the speech recognition text results. Grammar specifications written this way may represent many thousands of possible sentence inputs giving the end user great speaking flexibility.

Conclusion and Future Scope

We considered the possibility of accessing web through an ordinary phone. We presented a new technology which provides a true audio Internet experience. Using an ordinary telephone and simple voice commands, users will be able to surf and hear the entire Internet for the information they desire. A computer is not needed. Any web page will be accessible, but not limited to sites written with Wireless Application Protocol, and pages that are specially written in Voice Extensible Mark-up Language (VXML). We presented a detailed analysis of how the technologies like SR, TTS and AI are integrated to develop a intelligent Platform (PhoNET) to achieve voice based web access. We presented the major problems involved in Document processing and document rendering along with solution.




4. Internet speech Inc.

5. Avaya Labs




Related Seminar Topics