Elearning Tech

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Wednesday, 30 June 2010

Digital Signal Processor and Text-to-Speech

Posted on 06:34 by Unknown

This is the second post in a series on Text-to-Speech for eLearning written by Dr. Joel Harband and edited by me (which turns out to be a great way to learn).  The first post, Text-to-Speech Overview and NLP Quality, introduced the text to speech voice and discussed issues of quality related to its first component – the natural language processor (NLP). In this post we’ll look at the second component of a text to speech voice: the digital signal processor (DSP) and its measures of quality.

Digital Signal Processor (DSP)

The digital signal processor translates the phonetic language specification of the text produced by the NLP into spoken speech. The main challenge of the DSP is to produce a voice that is both intelligible and natural.  Two methods are used:

  • Formant Synthesis.  Formant Synthesis seeks to model the human voice by computer-generated sounds, using an acoustic model. Typically, this method produces intelligible, but not very natural, speech. These are the robotic voices, like MS Mike, that people often associate with text to speech. Although not acceptable for eLearning, these voices have the advantages of being small and fast programs and so they find application in embedded systems and in applications where naturalness is not required as in toys and in assistive technology.
  • Concatenative Synthesis. To achieve the remarkable naturalness of Paul and Heather, concatenative synthesis is used. A recording of a real human voice is broken down into acoustic units: phonemes, syllables, words, phrases and sentences and stored in a database. The processor retrieves acoustic units from the database in real time and connects (concatenates) them together to best match the input text.

Concatenative Synthesis and Quality

When you think about how concatenative synthesis works – joining together a lot of smaller sounds to form the voice, it suggests where there can be glitches.  Glitches will occur either because there’s not a recorded version of exactly what the sound should be or will occur where the segments are joined when it doesn’t come together quite right. The main strategy is to try to choose database segments that are as long as possible– phrases and even sentences – to minimize the number of connection glitches.

Here is an example of a glitch in Paul when joining the two words “bright” and “eyes”. (It wasn’t easy to find a glitch in Paul – finally found one in a Shakespeare sonnet!)

  • Mike - bright eyes
  • Heather - bright eyes
  • Paul - bright eyes

The output from the best concatenative systems is often indistinguishable from real human voices. Maximum naturalness typically requires speech databases to be very large so the larger the database the higher the quality. Typical TTS voice databases that will be acceptable in eLearning, will be on the order of 100-200 Mb. For lower fidelity applications like telephony, the acoustic unit files can be made smaller by using a lower sampling rate without sacrificing intelligibility and naturalness, making a smaller database (smaller footprint).

By the way, the database is only used to generate the sounds which are then stored as .wav, .mp3, etc.  It is not brought along with the eLearning piece itself.  So a large database is generally a good thing.

Here is a list of the TTS voices offered by NeoSpeech, Acapela and Nuance with their file sizes and sampling rates.

Voice

Vendor

Sampling rate (kHz)

File Size (Mb)

Applications

Paul

NeoSpeech

8

270  (Max DB)

Telephone

Paul

NeoSpeech

16

64

Multi-media

Paul

NeoSpeech

16

490  (Max DB)

Multi-media

Kate

NeoSpeech

8

340  (Max DB)

Telephone

Kate

NeoSpeech

16

64

Multi-media

Kate

NeoSpeech

16

610  (Max DB)

Multi-media

Heather

Acapela

22

110

Multi-media

Ryan

Acapela

22

132

Multi-media

Samantha

Nuance

22

48

Multi-media

Jill

Nuance

22

39

Multi-media

The file size is a combination of the sampling rate and the database size, where the database size is related to the number of acoustics units stored. For example, voices 2 and 3 have the same sampling rate, 16, but voice 3 has a much bigger file size because of the larger database size. In general, the higher sampling rates are used for multimedia applications and the lower sampling rates for telecommunications.  Often larger sizes also indicate a higher price point.

The DSP voice quality is then a combination of the two factors: the sampling rate, which determines the voice fidelity and the database size which determines the quality of concatenation and frequency of glitches – the more acoustic units stored in the database, the better the chances of achieving a perfect concatenation without glitches.

And don’t forget to factor in Text-to-Speech NLP Quality.  Together with DSP quality you get the overall quality of different Text-to-Speech solutions.

Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Discussion Forums for Knowledge Sharing at Capital City Bank
    Looking at Capital City Bank from the outside, I wouldn’t have expected to find a great example of social learning inside.  They are a sol...
  • Blogs, Social Networks and LinkedIn Answers
    I received a great question from someone relative to my last post - Required Reading for Training Managers where I continue to suggest the ...
  • Low-Cost Test and Quiz Tool Comparison
    I’ve recently been working with Sameer Bhatia the CEO of ProProfs as I was going through an evaluation of various online tools that support...
  • Themes Of
    This month's big question - What Did You Learn about Learning 2008 is almost an annual tradition of asking people to look back on the y...
  • How Khan Academy Nike Training Club and SparkPeople Motivate Users Behavior
    I mentioned in my post Online Systems for Behavior Change that I'm working on a very interesting project that is designed to lead to so...
  • Kids Search
    It's surprising how much I learn by watching my kids (now 13, 11 and 8) go through school today. I've mentioned before the experien...
  • Text-to-Speech Costs – Licensing and Pricing
    This post is part of the series on Text-to-Speech (TTS) for eLearning written by Dr. Joel Harband and edited by me. The other posts are: ...
  • Concept Worker
    Daniel Pink's book A Whole New Mind: Why Right-Brainers Will Rule the Future contains a description of new age - the Conceptual Age. He...
  • Training Method Trends
    Steve Wexler who runs eLearningGuild research has been producing some really interesting information recently. All of the data comes from ...
  • eLearning Conferences 2010
    You can find other posts about eLearning Conferences in eLearning Conferences 2011 , eLearning Conferences 2010 , and eLearning Conferences ...

Blog Archive

  • ►  2012 (6)
    • ►  November (2)
    • ►  October (1)
    • ►  September (1)
    • ►  June (1)
    • ►  January (1)
  • ►  2011 (15)
    • ►  November (1)
    • ►  October (1)
    • ►  August (1)
    • ►  June (1)
    • ►  April (2)
    • ►  March (2)
    • ►  February (5)
    • ►  January (2)
  • ▼  2010 (58)
    • ►  December (1)
    • ►  November (4)
    • ►  October (5)
    • ►  September (2)
    • ►  August (6)
    • ►  July (4)
    • ▼  June (5)
      • Digital Signal Processor and Text-to-Speech
      • Learning Flash
      • Online Exam Preparation and Tutoring – Hot Market
      • eLearning Learning Sponsored by Rapid Intake
      • Text-to-Speech Overview and NLP Quality
    • ►  May (5)
    • ►  April (4)
    • ►  March (5)
    • ►  February (7)
    • ►  January (10)
  • ►  2009 (223)
    • ►  December (10)
    • ►  November (14)
    • ►  October (14)
    • ►  September (16)
    • ►  August (12)
    • ►  July (16)
    • ►  June (22)
    • ►  May (20)
    • ►  April (22)
    • ►  March (23)
    • ►  February (28)
    • ►  January (26)
  • ►  2008 (196)
    • ►  December (25)
    • ►  November (27)
    • ►  October (17)
    • ►  September (18)
    • ►  August (8)
    • ►  July (16)
    • ►  June (23)
    • ►  May (18)
    • ►  April (12)
    • ►  March (17)
    • ►  February (15)
Powered by Blogger.

About Me

Unknown
View my complete profile