For the last couple of days I have been messing around with formant synthesis. I started looking into it because I want to make a virtual pet application with a voice. The voice needs to be able to sing, and speak with a great deal of expression in order to be fun as a character in the app. I cannot use prerecorded speech because I want to have a lot of dynamic content, and I cannot use a text to speech system that is based on concatenation because these systems don't have nearly enough flexibility in terms of inflection etc. The only other option I am aware of is formant synthesis, which attempts to model the human voice mathematically. There are a few old formant based synthesis solutions around, but the ones I managed to find that were still available were too expensive. So, somewhat pretentiously, I decided to try my hands at making a little formant based voice myself. I have no idea whether I will be able to make it say a single word, let alone make something even remotely usable, but I have some free time over the Easter holidays so what the hell. Here goes.
First, the information that I was able to find about formant synthesis on Google was very hard to read. There were several papers on the subject, but they went several feet above my head so I gave up in frustration. Then there were lots of historical documents, but while they were interesting to read in themselves, they did not provide any information on how to actually build a formant synthesizer. Then, I came across a page that mentioned in passing that the basics of a formant voice is made up of an oscillator (usually a sawtooth waveform), which is then pushed through a series of 2 pole low pass filters. The filters are configured in such a way as to model a single formant, or sound, made by a particular speaker (everyone's formants are different, of course).
I then started looking around for more information on how to actually configure these filters. From what I knew previously, a lowpass filter just takes away frequencies above a certain threshold. But I had absolutely no idea how lowpass filters would be of any use to me when trying to turn a sawtooth into something resembling a voice. Then, I came across an interesting page. It is a listing of several vowels and their formants, ranging from bass to soprano. Hurray! While I don't exactly know what the bandwidth and amplitude do, I am used to seeing these parameters in various lowpass filters I've played around with. So the next task was to find an implementation of a lowpass filter, which was not too hard.
biquad.c from musicdsp.org did just what I needed, plus a lot more. Apparently a biquad filter can act as a base for lots of filters (including low pass with x number of poles). Bingo. Now, all I had to do was to try the chain (generate the sawtooth, configure the five filters as listed in the formant table, and play the resulting signal).
I found some oscillator code in the Tonic project on GitHub. The alias-free sawtooth was part of a larger framework that I ripped out, and generated a two second sawtooth with a quick pitch bend in the middle. It starts on A2 and then goes up a fifth after one second.
I lowered the volume a bit, since it nearly blasted my ears off initially.
Then, it was time to configure the filters. One problem that I hit, and which I struggled with for at least three hours, was how to convert a filter bandwidth specified in HZ to octaves. The filter implementation that I found takes the bandwidth in octaves while the formant tables all use HZ. Grr! I found a bunch of different formulae, but I could not really understand them. In the end I went for a cheap approximation from a forum post.
The next problem I struggled with was how to actually get the output I was expecting. I grabbed the numbers representing an A as sung by a bass, and configured the filters accordingly. But then what? Should I be pushing the signal through the filters one after the other, or push the original signal through each filter separately and then combine the results? I went for the latter approach. The result was actually quite cool! The output was completely flat sounding, of course, but you can definitely tell that it is trying to say the vowel A. I wasn't entirely happy with the settings from the original table, though, so I tweaked them ever so slightly until I got something I liked better.
Super cool! The code is an absolute mess, but I don't care for the moment. I'm hacking. My next task is to see if I can transition between two vowels - presumably you just ramp all the values from the source to the destination over x number of samples, but I have no idea. I'm not even looking into consonants yet; they scare me. Now, time to make the first commit into the repository and then get some well needed sleep.