AI Text-to-Speech Tools Advance at Breakneck Speed, Enhancing Training Development

It’s remarkable how rapidly AI-powered text-to-speech technology has advanced over the past six months.

The current level of technology is so advanced that it can produce spoken words from a written script that sound almost natural, making it difficult for some listeners to discern the difference. Some high-end models can accurately generate speech in numerous languages, while also allowing for the manipulation of attributes such as intonation, inflection, intensity, and pauses. My Chinese colleagues have even confirmed the quality of a Chinese-language training video I produced using a text-to-speech tool that doesn’t support attribute management. Moreover, it’s possible to combine multiple languages seamlessly within a single script, transitioning from spoken Chinese to spoken English and back again.

These tools will be game-changing for training developers who produce content in multiple languages, as a single text-based script can be written in one language, translated into other languages, and then rendered by the AI text-to-speech tool. Below are two samples I produced from the same script, one in English and the other in Chinese. (Note – these videos were produced as tests. They have not been cleaned up for actual rollout to learners.)

English Language Version:

English Language Test

Chinese Language Version:

Chinese Version Test

In addition to examining the specific tool utilized to generate these examples, I have also explored several other tools, such as Amazon Poly, Speechify, Descript, and Google Cloud Text to Speech.

In my evaluation, text-to-speech technology is suitable for internal training when audio-based training is required, and resources are limited for hiring a professional voiceover artist. However, when it comes to customer-facing training, caution should be exercised. While text-to-speech technology may be appropriate for specific customer training use cases, such as detailed training on products, APIs, integrations, and services, it might be better to invest in professional voice-over talent for initial customer training modules, when a key focus in the customer journey is relationship building.

Despite the impressive advances made in the last six months, there is still room for improvement. Some listeners can still distinguish between AI-generated speech and natural speech, particularly in more complex sentences. However, with the rapid pace of development, it’s only a matter of time before AI text-to-speech tools become virtually indistinguishable from human speech. Also, learners are quickly being exposed to AI-rendered speech via content posted to social media platforms.

The recent advancements in AI text-to-speech technology are nothing short of remarkable. They have the potential to transform language training, enhance accessibility for individuals with disabilities, and facilitate cross-cultural communication. As technology continues to evolve, we can only anticipate further breakthroughs in the field of AI-generated speech.

A Little More on the Technology

Here’s a little more background on how AI-powered text-to-speech models work. These cutting-edge tools are revolutionizing the way we interact with technology, making it easier to communicate and learn in multiple languages.

At its core, an AI text-to-speech model is a machine learning algorithm that is trained on large datasets of audio and corresponding text transcripts. These datasets, known as “corpora”, contain vast amounts of recorded speech and text that serve as the model’s training data.

The model is designed to learn the patterns and relationships between the text and the corresponding speech, using a process called “supervised learning”. During this process, the model is trained on thousands of examples of text and corresponding speech, and it gradually learns to associate each word or phrase with the appropriate speech sound.

To achieve this, the model is divided into two main components: the text processing unit and the speech synthesis unit. The text processing unit is responsible for processing the input text and extracting the relevant linguistic features such as phonemes, intonation, and rhythm. The speech synthesis unit takes these linguistic features and generates a corresponding speech signal.

The model’s ability to accurately generate speech is heavily dependent on the quality and diversity of the training data. The more varied the training data, the more accurate and diverse the speech output will be. For this reason, researchers often employ domain-specific corpora, such as medical or legal transcripts, to train the model for specific applications.

If you like this, please share.

More: An Intro to SaaS Architecture