Everything You Need to Know About Text-to-Speech

Everything You Need to Know About Text-to-Speech

Modern localization is like an intricate mosaic – there are thousands of different pieces that need to be positioned just right in order to create the final picture. Today, we’re going to look at a topic that’s becoming increasingly popular among our customers: text-to-speech technology.

What is text-to-speech?

Text-to-speech (TTS) is the name given to the process of artificially generating spoken language based on text input. In their early form, these “computer voices” were mainly used to make written language more accessible to people with visual impairments and learning difficulties.

Listen for yourself – the quality of our text-to-speech service never fails to impress!

How does the technology work?

  • First, the selected text is analyzed and phonetically interpreted – a process known as Natural Language Processing (NLP). In order to do this, the system breaks the entire chain of characters down into individual “units” or sounds, before processing them according to a set of underlying rules. Contextual analysis is then used to disambiguate the stresses in the words and sentences: Does the abbreviation “m” stand for “meters” or “million”? Is the verb “read” being used in the present tense (to rhyme with “need”) or the simple past (to rhyme with “bed”)? This is important information that spelling alone cannot provide.
  • Once the NLP phase is complete, sophisticated algorithms join the individual phone tic units together to create a fluent audio text. The system precisely calculates the correct pronunciation, stress and intonation (prosody) required to produce a natural-sounding sequence of sounds, and its understanding of the syntactical structure of a sentence enables it to recognize when a constituent needs to be highlighted as important or new.
  • Finally, the processed data are sent to a digital signal processing module, which then generates the acoustic language signal.

Text-to-speech … you mean those monotonous robot voices?

Not anymore! The days when synthetic voices still sounded reassuringly “robotic” and were generally of quite low quality have long since passed. While you can often still tell when a voice is artificial in origin, the use of neural networks means that the phonemes, syllables and words that make up the language generated by these models are now much more human in tone, with a natural cadence. Users find these voices much more friendly and expressive than their predecessors.

Fine! But what does all this have to do with translation?

Multimedia communication plays an increasingly influential role in modern society. This represents a huge challenge for many businesses when it comes to communicating with their customers and consumers – not least because of the high cost and complexity of professional audio production.

Text-to-speech offers businesses a new way of reaching potential customers in a wide range of global markets, allowing them to add high-quality language output to a greater variety of multimedia content and make additional content accessible to all.

So where exactly can I use text-to-speech? And where is it best avoided?

Generally speaking, this technology can be used for most audiovisual content of an informative nature. Of course, if you’re making an expensive marketing video or publishing sensitive internal communications, we would still recommend using professional voice artists. But text-to-speech can be a valuable aid in many other scenarios – from training videos and software demos to in-house safety briefings. TTS-based vocal user interfaces have been proven to improve teaching success in the field of e-learning, for example.

What commercial benefits can text-to-speech offer my business?

  • Free up your media budget

    Traditional voiceovers come with high personnel and resource costs – from the casting of professional voice artists to booking sound engineers, mastering, and all the corrections that need to be made once the initial recording is complete. The huge reduction in production costs is one of the key reasons why many businesses are switching to text-to-speech.

  • Simplify your planning process

    Voice artists can’t usually be booked at short notice, and often won’t be available for the entire run of your project if it’s due to take a long time. TTS technology enables you to create multimedia content in a fraction of the time, with low personnel costs and technical workloads. The voices in a text-to-speech engine are available whenever you need, so you can use one consistent voice for the entirety of your project.

  • Perfect for content updates

    Making corrections to scripts after the initial recording can be costly and complex, especially when you need to book your voice artist and studio time again. In the long run, many businesses find it impossible to justify the resulting expense and workload. With text-to-speech, you can keep tweaking your spoken texts as often as you like, even if it’s been months since you first produced them.

So basically, I can produce perfect spoken texts at the push of a button?

Well of course, it isn’t quite that simple. It’s not unusual for common TTS engines to put completely the wrong stress on words, especially if they are neologisms or foreign words. This can cause problems when the word in question is an important part of your corporate language that needs to be pronounced correctly – be it something basic like your company name or an important piece of specialist terminology.

Is there a silver bullet for dealing with quality issues?

If you want your multimedia content to have an all-round professional sound, you will need specialists to run quality checks on your audio output and correct the stress and flow of your spoken text. However, one of the benefits of TTS voices is that they can be adapted entirely to suit your needs. Many services offer you hundreds of voice profiles and dozens of languages to choose from, and allow you to create flexible presets for parameters such as volume, pitch, speed, and the pronunciation of abbreviations, dates and times using Speech Synthesis Markup Language (SSML).

Have we piqued your curiosity?

Great! Follow the links below to find out more about the most popular text-to-speech systems:

Milengo also has its own TTS solution, which has been developed with innovated quality assurance.

Johannes Rahm

read all posts

A seasoned translator, copywriter and multilingual SEO expert with over a decade of experience. Johannes specializes in high-value B2B marketing content for the DACH market, serving leading companies in the software, IT, and elearning industries. As an avid reader of science-fiction literature, he still regards human language to be our most mind-blowing technology and loves to explore its power to engage, inspire and connect people and organizations.