Breder.org Software and Computer Engineering

Cloud-Assisted Command-Line Text-To-Speech (TTS)

I have previously written about running natural-sounding Text-to-Speech (TTS) models locally on a somewhat powerful -- and power-hungry -- desktop computer.

For my current hardware, aside from that desktop, running these models is not practical. Still, it would be nice to have natural-sounding TTS within a few keystrokes.

Cloud services such as Azure offer a TTS REST API. I found it easy to get started, although I'm not sure if this service is reasonably priced if you are to use this for more users than yourself.

The script is aptly-named tts, and it takes a single argument. For example, you can call it from the command line:

./tts "Hello world!"

The more useful application is copying the text of an article, then calling on the tts script to have the article read aloud in your computer.

Here's the source code:

#!/bin/sh
PLAYER=mpv
AZ_KEY=<generate it from Azure console>
VOICE=en-US-AvaNeural
ENDPOINT=https://brazilsouth.tts.speech.microsoft.com
FORMAT=ogg-48khz-16bit-mono-opus

curl -X POST "$ENDPOINT/cognitiveservices/v1" \
	-H "X-Microsoft-OutputFormat: $FORMAT" \
	-H "Ocp-Apim-Subscription-Key: $AZ_KEY" \
	-H "Content-Type: application/ssml+xml" \
	-d "<speak version='1.0' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='$VOICE'>$1</voice></speak>" 2>/dev/null | $PLAYER - >/dev/null 2>/dev/null

This is written in a streaming manner using Unix pipes, which means that as soon as the cloud API starts streaming the audio, the MPV audio player will start reproducing the sound through the speakers, which minimizes latency.

Fine adjustments to the speech can be made by tuning the Speech Synthesis Markup Language (SSML) markup, but I haven't felt the need for doing that yet.

Cooler yet, since this is a REST API, it could also be integrated with a web page -- but check the docs to get a per-request bearer token, instead of leaking your AZ_KEY to a browser client -- so one could use it to, say, allow website visitors to read aloud an article such as this one straight from their browsers.