There’s WALL-E. And then there’s VALL-E

Did you see the Pixar movie WALL-E? It was named as one of the best movies in 2008, and it’s still fun to watch after all these years. A movie for the whole family. One of the most thrilling parts of the movie is when a rogue AI — AUTO — fights with humans and human-friendly robots (of course, powered by AI.) And no surprise here, AUTO was acting based on a directive from humans.

How things changed since 2008 when there was no AI.

I am not sure if the movie, or at least the name, was an inspiration for the researchers at Microsoft, but here we are. Fresh from Microsoft’s research labs, I give you: VALL-E. A neural code language model for speech synthesis. It sounds very exciting!

What does it do?

With just a three-second voice recording, it can transform any text to speech in the voice provided. Truly, a technological achievement. To quote Microsoft: 'VALL-E could preserve the speaker’s emotion.'

How did the researchers achieve that? How did they train a system like this?

Like this: 'During the pre-training stage, we scale up the [text to speech] training data to 60,000 hours of English speech, which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.'

Somehow they omit a detail. Where do you find 60,000 hours of speech!?! Where is the library of non-copyrighted sound which allows you to do that? Did they ask 60,000 employees of Microsoft to talk for one hour? Or are they capturing all the conversations in Teams?

Naturally, there are the disclaimers: 'VALL-E could synthesize speech that maintains speaker identity and could be used for educational learning, entertainment, journalistic, self-authored content, accessibility features, interactive voice response systems, translation, chatbot, and so on.'

I like the all-encompassing '...and so on.' And the researchers are even aware of what can go wrong.

'It may carry potential risks in the misuse of the model, such as spoofing voice identification or impersonating a specific speaker.'

You think so?? Before reading this, it would never occur to me it could be used for something like that.

What could go wrong?

Do you remember back in May when Scarlett Johansson said OpenAI asked to use her voice twice?

You would think that somewhere in the research lab, somebody would say, 'I know it sounds really cool to work on something like this, but is this really the next breakthrough we are looking for?'

Just put 'AI Copyright' in the Google search bar, and you get pages and pages of articles. While the likes of OpenAI or Anthropic are fighting copyright lawsuits claiming that scraping content is a fair use, Microsoft takes almost 7 years of voice recordings and trains their voice system with it.

But the outcome is not: 'Write me an essay in the tone of Hemingway.'

The outcome is: 'Here is new text. Say it exactly as this person.'

This VALL-E research project nicely aligns with the misguided effort of Microsoft to resurrect Clippy with Copilot, and, even worse, with Recall.

The recurrent pattern? Sadly, humanity will have to learn how to live with another great innovation from Microsoft.

Previous
Previous

Software with a Soul

Next
Next

Google is a monopoly. Now what?