Sunday, February 1, 2009

TTS (Text-To-Speech)

Ram is dumb and deaf. This has however not stopped him from holding a meaningful conversation with his colleagues. We are not talking about the wonders of sign language here. Instead we are referring to an application of the Text-to-Speech technology using which Ashraf types in text and his colleagues hear it out. This is not a stray example, the need for TTS has been felt in everyday life too. Some interesting applications are:

Telecommunication services :
It is possible to access textual information over the telephone - this information could be your e-mail being read out to you, which is called Integrated Messaging. Texts might range from simple messages to huge databases which cannot be read and stored as digitized speech, Telephone Relay Service which allows you to have a conversation with hearing impaired persons and Automated CallerName and Address which is a computerized version of reverse directory.

Education :
TTS combined with a computer aided learning system can be used to learn a new language, with correct pronunciations and sentence formations. Students with speech disability, with the help of especially designed keyboards and quick sentence assembling programs, can overcome their handicap completely. Stephen Hawking, a leading Astro-physicist uses this system to give all his lectures. The visually impaired too benefit from TTS systems which are coupled with OCR (Optical Character Recognition) to give them access to written text. TTS is used in talking books which are available on line.

Vocal Monitoring :
Oral information is often more effective than written messages, it is more appealing and allows other visual information at the same time. It is also preferred on factory floors and industries where it is not possible for workers to read information while working. TTS is highly successful in measurement and control systems. They are widely used in security systems too.

Man-Machine communications :
TTS is obviously the link for the ultimate communication between man and machine.

Fundamental and Applied Research : TTS systems are considered the best guinea pigs for linguistic research since repeated experiments produce the same result which is hardly the case with human subjects. This characteristic makes TTS systems very popular with phoneticians.

ORIGIN

Text-to-Speech systems find their roots with Speech synthesis, the research of which dates as far back as 1939, which ironically predates the computer. The device called the 'Voder' created by Dudley in the Bell Laboratories was an analog speech synthesis system. In early 1970s the Text to Speech systems were created as a natural progression of concatenative synthesis by Joseph Olive. However today there exist many more methods for creating Text to Speech systems.

INSIDE TTS

What is TTS?

Very simply, a TTS synthesizer is a computer based system which can read aloud any text dynamically, whether this text is entered by the user or any other system. Putting it differently, it is a system which can read aloud any sentence entered even if it is for the first time for the system. This makes TTS systems fundamentally different from talking machines in the market which basically act as record players. These machines are capable of concatenating isolated recorded words and playing them out loud. These are also called as Voice Response Systems, they have only a limited vocabulary, for example, announcements at railway stations, reservation enquiry systems, etc. When we talk about TTS we are talking of just about anything being read out. This makes it impossible to record all words for a particular language, hence Text to Speech is often defined as 'automatic production of speech, through a grapheme-to-phoneme transcription of the sentences to utter'.

On the face of it, speech synthesis appears to be an achievable task since humans learn to speak at a very early age and the mechanics of learning are very simple like learning how to pronounce vowels and the phonetic descriptions of all alphabets. This is far from truth though. The vocal sounds we produce while speaking are a result of lung pressure, glottis tension, and configuration of the nasal and the vocal tract which never remain the same and keep evolving. All the above factors are in turn controlled by the cortex which uses these factors to transmit the meaning of the sentence.

Today reproducing the same scientifically is possibly conceivable using neural networks, speech synthesis and semantic analysis but such a machine would be highly complex and economically out of reach by the common man. So the current systems do not exhibit any feelings or should we say are lacking in their naturalness as a sacrifice to simplicity and affordability.

A block diagram of TTS synthesizer is displayed below:


When we talk about TTS today, we are referring to Model based TTS systems which try to imitate the human sound production system. It has two main components:
Natural Language Processing or NLP : It is responsible for producing the phonetic form of the text read, or simply how the text would sound coupled with intonations and rhythm together called 'prosody'.

Digital Signal Processing or DSP : DSP has the onus of converting the symbolic information comprising phonemes and the prosody into speech using algorithms and computations.The entire procedure is extremely memory hungry and to reduce the memory expense, some procedures are short circuited. This results in reducing the naturalness of the voice created while making the system more efficient with less memory requirements.

Text-To-Speech Synthesis Approaches

Text-to-speech (TTS) technology has traditionally been classified into two main categories and a third that is a hybrid of the first two:

Concatenated TTS: This approach uses concatenated recordings of human voice from a library or database. The text to be read is analyzed, the recordings pieced together and the sentence created. Earlier systems used complete words and phrases for concatenation, recent systems use smaller basic units like syllables, diaphones etc.

Advantages:

Since this system uses human voices, it sounds more natural and less synthetic or mechanical.

Disadvantages:

The concatenated systems usually have poor quality since concatenation sacrifices rhythm. In order to increase quality, modern systems have used smaller units of sound for concatenation. Which means each diphone (or whatever smallest unit is used) needs to be recorded separately for each intonation to provide naturalness. High quality concatenated systems are very resource hungry and hence not applicable to desktop applications. These are usually used for server side applications only. These systems are not very flexible which means that in order to create a new voice the entire database for the voice needs to be created again.

Model based TTS : This system mimics the human speech production model. The text is read and each of the words are analyzed for their phonetic pronunciation and passed on to algorithms which are responsible for producing the sound. When one refers to TTS , it u

No comments:

Post a Comment