Google claims its machine can now ‘speak like a human’

  • The Tacotron 2 system can generate natural-sounding speech from text
  • It was trained on human speech and text transcripts, the Google researchers say
  • The team says human listeners rated it comparable to professional recordings

By Cheyenne Macdonald For Dailymail.com

Published: 18:16 EST, 28 December 2017 | Updated: 18:17 EST, 28 December 2017

Google has revealed a new text-to-speech system that could soon allow AI voice assistants to sound far more natural.

The tool, called Tacotron 2, was trained on examples of human dialogue and text transcripts to generate more realistic speech.

A demonstration of the system reveals how it can smoothly read different texts aloud without skipping a beat, including the ‘Peter Piper’ tongue-twister – but, it’s still tripped up by difficult words.

Scroll down for video

Google has revealed a new text-to-speech system that could soon allow AI voice assistants to sound far more natural. The tool, called Tacotron 2, was trained on examples of human dialogue and text transcripts to generate more realistic speech. Stock image

Google has revealed a new text-to-speech system that could soon allow AI voice assistants to sound far more natural. The tool, called Tacotron 2, was trained on examples of human dialogue and text transcripts to generate more realistic speech. Stock image

HOW IT WORKS

Tacotron 2 uses what’s known as a sequence-to-sequence model, which maps out letters to features that encode the audio.

The process incorporates pronunciation, volume, speed, and intonation, the researchers explain.

Then, the features are converted to a 24 kHz waveform.

The tool was trained on examples of human dialogue and text transcripts to generate more realistic speech.

In a new blog post, the Google researchers explain that the latest text-to-speech (TTS) system is able to generate its own speech based on what it learned from its training.

This is opposed to the use of inputs such as complex linguistics and acoustic features seen in other TTS systems.

Tacotron 2 improves upon ideas from previous efforts, including Tacotron and WaveNet.

The team says listeners have rated it comparable to professional recordings.

Tacotron 2 uses what’s known as a sequence-to-sequence model, which maps out letters to features that encode the audio.

The process incorporates pronunciation, volume, speed, and intonation, the researchers explain.

Then, the features are converted to a 24 kHz waveform.

Audio samples from the research demonstrate how Tacotron can generate speech to read specific texts.

It’s not quite perfect yet, but the team says it scored well in trials with human listeners.

Still, there are a number of issues left to address.

‘While our samples sound great, there are still some difficult problems to be tackled,’ the researchers explain in the blog post.

‘For example, our system has difficulties pronouncing complex words (such as ‘decorum’ and ‘merlot’), and in extreme cases it can even randomly generate strange noises.

Tacotron 2 uses what’s known as a sequence-to-sequence model, which maps out letters to features that encode the audio. The process incorporates pronunciation, volume, speed, and intonation, the researchers explain. Then, the features are converted to a 24 kHz waveformTacotron 2 uses what’s known as a sequence-to-sequence model, which maps out letters to features that encode the audio. The process incorporates pronunciation, volume, speed, and intonation, the researchers explain. Then, the features are converted to a 24 kHz waveform

Tacotron 2 uses what’s known as a sequence-to-sequence model, which maps out letters to features that encode the audio. The process incorporates pronunciation, volume, speed, and intonation, the researchers explain. Then, the features are converted to a 24 kHz waveform

‘Also, our system cannot yet generate audio in realtime. Furthermore, we cannot yet control the generated speech, such a directing it to sound happy or sad.

‘Each of these is an interesting research problem on its own.’

The project builds on some of the ideas from WaveNet, which was said to be capable of creating natural-sounding synthesized speech by analyzing sound waves from the human voice – rather than focusing on the human language.

Last year, the DeepMind researchers claimed the groundbreaking project had already halved the quality gap between computer systems and human speech.

The latest system aims to take text-to-speech even further, for more natural sounding computer-generated speech.

Read more:

Original Article

The post Google claims its machine can now 'speak like a human' appeared first on News Wire Now.

Leave a Reply

Your email address will not be published. Required fields are marked *