Breder.org

A Workflow To Turn e-Books Into Audiobooks with Natural Text-to-Speech Voices

It seems that the AI judgment day has come and there is no software product that will be left standing without AI-powered smart features. Let me ride this wave for my own gain: hopefully encouraging me to finish more books.

With modern ML approaches, my thesis is that there must be an open source natural-sounding computer generated voice software leveraging the latest Deep Learning and Machine Learning developments, thus replacing any need to listen to robotic computer-generated voices (such as the trademark Stephen Hawking voice).

I have some requirements I set for myself:

1. From E-Books Into Text

Let me wave my hands and not mention how the e-books are sourced and leave this as an exercise to the reader. Assume that we're downloading public domain books from Project Gutenberg.

There are a few formats that e-Books usually find themselves in: PDF, epub and mobi. If the e-Book is a series of scanned images from a physical book, tough luck. We need to extract the plain text and Optical Character Recognition (OCR) is not in the scope.

I found that Calibre is a great tool to convert between e-Book formats. For the few books that I converted from epub to txt, I used the following command:

ebook-convert source.epub output.txt

It's as easy as that: the output will be a plain-text representation of the contents of the e-Book. You can also use it to export PDFs or any other formats by adjusting the extension of the output filename.

2. From Text Into Spoken Sentences

Now I'd suggest going into the /tmp/ directory and creating a folder there, as to not wear-out any storage medium for these intermediate steps.

For the TTS software, I'm using piper, which is available at github.com/rhasspy/piper. I'm using the en_US-lessac-high model downloaded from Hugging Face.

The command which generates a WAV file for each sentence in the source text goes as follows:

mkdir output_dir/
cat source.txt | piper -m en_US-lessac-high.onnx -d output_dir/

In my computer, the ratio between the audio duration and the time it takes to generate is 5.44 (meaning for each minute I leave my computer running, more than 5 minutes of audio are generated).

I estimate a typical audiobook for a 200-page book clocks in at somewhere between 6 and 8 hours of spoken audio, so I'm guessing it should take somewhere around one and two hours to generate the audio for all the sentences. Sweet.

3. From Spoken Sentences Into a Single File

Now we have a bunch of WAV files, one per sentence. The easiest way I found was to use the sox program to concatenate the raw audio files.

sox output_dir/*.wav concatenated.wav

Maybe there's some way to adjust the pauses between the sentences at this step, but I found the unmodified configuration to work well enough for me.

4. Compressing the Audio Into MP3

WAV files are uncompressed audio files and are needlessly huge. Let's use ffmpeg to convert our concatenated audio into a MP3 file. You may also use FLAC as the output format instead, which is lossless, but I honestly cannot hear the difference.

ffmpeg -i concatenated.wav final_output.mp3

Now move final_output.mp3 out of the /tmp/ directory and enjoy your computer-generated audiobook!

Benchmarks

As a benchmark, the bitrate of the WAV file was 352 kb/s, and the bitrate for the corresponding MP3 file was 32 kb/s, so compressing the audio file reduces the disk usage ten-fold.

Also as a benchmark, a 200-page book led to a 92MB MP3 file with 6 hours and 41 minutes of spoken audio. The official audiobook has a listening length of 7 hours and 55 minutes, so the generated audio reads slightly faster than a professional audiobook narrator.