Before I started learning about phonetics and speech synthesis, I didn't know what the concept of "voicing" meant. Now, I am only too painfully aware of what it means and have developed a hatred for it which will burn for at least five days.
Provided that my understanding is actually correct, voicing is the amount of pulse (AKA the amount of vibration from the vocal cords) that a given sound requires. In English, many unvoiced sounds have a voiced equivalent (or rather semi-voiced). F, for instance, becomes V if you add voicing to it. S becomes Z, SH becomes ZH and so on. H doesn't seem to have a voiced equivalent, thankfully.
So why do I hate these so much? Simply because the filter settings required to produce them are really really annoying to find. The frequencies that should be allowed through in the voiced signal when producing an unvoiced sound are roughly in the same range as the vowels, while the same sound requires much higher frequencies when trying to construct the noise (AKA the unvoiced) part. So in a sense, we need two separate sets of filter settings - a low set for the voiced signal and a high set for the unvoiced signal).
I knew that the output of unvoiced and semi-voiced sounds would be bad with only one set of filter settings. But how bad, I could not have imagined. To test, I tried to come up with a sentence that includes as many semi-voiced sounds as possible. I landed on the following:
"I love these rather visual roses."
This has V at the end of "love" and at the beginning of "visual", DH at the beginning of "these" and in the middle of "rather", ZH in the middle of "visual", and Z in the middle and the end of "roses". So let's take a listen to what this sounds like if you apply the filter settings intended for the unvoiced source to the voiced one:
Aside from suddenly sounding like a badly configured synthesizer, we also have a problem with the way the vowels in "visual" are rendered. Leaving that for the moment, how on Earth do we get rid of this buzzy sound and the infuriating sweeps? I tried a few things:
* Set the fifth formant frequency to 400 HZ, so that it would act as a lowpass filter. This worked in the sense that I got rid of the buzz in the middle of the sound, but the transition between 400 and whatever frequency the fifth formant needed to have before and after the semi-voiced sound was too great. The result was that the middle of the semi-voiced sound was fine but there was still a horrible sweep at the edges, so I scrapped that idea.
* Try the same with the first formant, based on the theory that the frequencies would be a lot closer (the first formant is the lowest frequency out of the five and is often pretty close to 400 in voiced sounds. This didn't work at all because it broke the relationship between the first and the second formant which is really important to maintain, so I scrapped that idea as well.
After a few hours rest and some chicken nuggets, my frustration with formant frequencies was down to a manageable level once again. The solution I finally went for is rather hacky (as if the first two weren't), but worked fairly well. I simply scaled all five formant frequencies by a certain factor to make them lower, with the assumption that the phoneset contains parameters that are optimized for the noise source. It uncovered a few issues in some of the unvoiced sounds that I was using so I had to redesign them, but I got something that sounds pretty reasonable in the end. Here it is:
Granted this still sounds like he has some sort of speech defect, but it does show that we can achieve something that is fairly close to unvoiced sounds with a simple scaler applied to the unvoiced parameters. ZH is very broken at the moment because I haven't found the right settings for it yet, but I will keep hacking away at it.
For clarity, here is the same sentence rendered at a slower speed:
While working on this I had to split up the sets of filters into two (one for the voiced signal and one for the unvoiced), but I had a feeling that I would need to do that sooner or later anyway. The next step is to combine the voiced and unvoiced sources with their respective filter settings, which should hopefully lead to something that is a bit easier to understand. As I write this, the code in the repository is sort of broken - you can't make it produce any unvoiced signal at all at the moment, but this will be fixed as soon as I introduce envelopes.