VALL-E is a zero-shot text to speech synthesizer which uses a neural codec language model for text to speech synthesis. It utilizes an off-the-shelf neural audio codec model and leverages existing TTS training data to generate high-quality personalized speech.

The main benefits of VALL-E include:

  • Zero-shot synthesis capabilities
  • In-context learning capabilities
  • High-quality personalized speech generation
  • Scaled up TTS training data to 60K hours of English speech

Possible use cases for VALL-E include leveraging the power of GPT text generation for natural language processing applications such as voice recognition, automated customer service, and voice search.

Screenshots

VALL-E - website homepage screenshot