Where are the Best Practices?

There seems to be a lack of concrete best practices out there around cutting audio for Twilio. As far as I can tell, the reason for this is twofold: 

  • Computers are stupid fast these days. Services like Twilio automatically transcode audio on the fly with decent results in many cases, then cache the result. Most people seem to be okay with this, but it does cause problems sometimes.
  • There's money to be made driving you towards professional voice talent agencies, and because they have specialized skills and equipment, the results are almost always higher quality than what the average developer could create independently.

If you want to cut your own audio anyway, great news! After a decade of working with platforms like Twilio, Voxeo, and Tropo I've pieced together the knowledge you're looking for. This article is a practical guide to producing high-quality prompts for modern cloud telephony platforms.

Optimal Audio Format

Here's the key: the last mile between the exchange and a land-line telephone (AKA local loop)  is an analog circuit, and all audio that travels across it is converted to a specific low-fidelity format: 8000hz 8-bit PCM mono uLaw

This constraint pegs telephony audio quality. It's the telephony standard, so even when you publish glorious high fidelity WAV files, they will be converted on the fly. The short story is that this incurs some trivial performance hits, but because this is for developers who are stressing out, I will dive in:

  • Conversion Quality: Your audio will be transcoded by the platform, and in olden days the results were often hideous. Today, the algorithms have improved tremendously, and many formats seem to convert very well. Lossy formats like 44.1 kHz .MP3 are downsampling disasters and should be avoided.

    See the sections below for tips about how to handle recording and downsampling.
  • Conversion Time: Platform-side transcoding can incur a slight delay, because the platform (whether it's Twilio, Voxeo, Tropo, etc) usually caches the converted audio after the first play. I would only expect to notice a delay with large files. 
  • Transmission: Higher quality sound files are larger, and will take more time and bandwidth to transmit between your web server and Twilio. Again, platform-side caching solves this. 

In a nutshell, the best practice is to publish your prompts as 8000hz 8-bit PCM mono uLaw

How do I get my prompts into this format?

The recording tools I use don't allow me to record at such a low fidelity, and even if they did, the best practice is to archive high-fidelity audio so you have as much control as possible later.  I get the best results with this workflow:

  • Record and edit at 48000hz. I use Ableton Live and record each prompt as a separate track.
  • Export all tracks as separate 48000hz  16 or 24-bit mono uncompressed WAV files. Ableton has a way to do this in bulk. 
  • Use Audacity to automatically trim dead space from the start and end of all files.
  • Batch convert the WAV files to 8000hz 8-bit PCM mono uLaw, also using Audacity.
  • Archive the Ableton project forever.

A word on sample rates...

Often the default sample rate will be 44100hz, but you're going to be downsampling to 8000hz, and the lore of my people clearly states that downsampling algorithms work best with multiples, and 48000 is a multiple of 8000. 

Tips

  • Before you spend a lot of time learning to be a sound engineer, why not try just uploading your high-quality audio to Twilio and check out the result? The quality might surprise you.
  • If you have dozens of audio files, you can spend a lot of time manually removing dead space from the beginning and end of each clip. Audacity, which is free, has a way to do this automatically, and in bulk over a folder full of files. Look into Chains, with Truncate Silence and Export WAV. If anyone has questions about this I'm happy to share. 
  • Audacity can also bulk convert your WAV files.
  • When you transcode down to the telephony standard, your audio will sound quite degraded on your computer, and indeed, that's exactly what you've done. But you have to listen over the phone to really know if you have a good result.
  • Using a compressor and equalizer can get the very best results. Remember that you won't necessarily know how you're doing until you hear the audio over the phone.
  • Do not try to downsample MP3s! 
  • Please don't use your web cam microphone for any kind of serious recording. One surprisingly good starter USB mic is the ATR 2100, which you can get on Amazon for about $60. Considering that you're going to downsample everything to 8Khz, I think it's very appropriate for the job.

UPDATE: Twilio has also posted a guide on this subject, maybe in response to this, and it has some nice tips on applying EQ: What Are Some Best Practices for Audio Recording

3 Comments