2019-01-08

Even though I haven't written anything in this diary since before Christmas, I have spent quite a bit of time working on this project since I have had some time off from my day job. All the details are on the timeline page, but here is a summary of the things I have done:

1. I got a friend to record a couple of short stories which contain all of the phones in the US phoneset, and then I and another friend went through the recording and spliced all the unique sounds into individual Wave files.

2. I then wrote a script for a program called Praat which is able to extract formant frequencies and bandwidths from these files automatically. The script can handle both voiced and unvoiced sounds, but I had to make the distinction manually between what is voiced and unvoiced, not to mention the fact that there are semi-voiced sounds as well. But after a lot of manual tweaking, I had formant profiles for all of the sounds that are needed for US English.

Here is an example of a static vowel:

Listen

That's the vowel aa again, the same that we had before, but this time extracted from the audio recording. As you can hear, it does sound a bit different but it is more or less the same sound. Let's check out iy as well:

Listen

That also sounds pretty good. But the vowels that actually contain diphthongs, such as in the word "I", are a lot more interesting. Here is what the ay phone sounds like:

Listen

Also pretty cool, but one step remains... To produce actual words from these individual sounds. This is a rather challenging problem. You could of course enter phones and durations manually, but this takes an enormous amount of time and most end users wouldn't be particularly happy with that. That's where the text processing steps come in. That's a field which deserves several books to itself, and while I did quite a bit of reading up on the subject I won't go into it here. Luckily, at least for US English I don't have to do much.

Flite is a lightweight version of Festival, a system intended for speech synthesis research which is very well designed and quite comprehensive. Flite has a text processing subsystem which I can plug in as the front-end to my own synthesis code in order to convert text to phones and even get duration and pitch information from it. I made a bunch of utility programs to extract this information manually for now, which can be found in the text_processing folder of this repository. So what remains is to run some text through the processor, get the phones and durations, and feed them to the synthesizer.

Then of course we have to take these phones and durations and make transitions between them in order to produce the final output. As I am writing this, I have only got a very very basic model in place which transitions from one phone to the next at the exact mid-point. This is probably incredibly wrong (God knows I'm not a linguist), but it works as a first attempt. I haven't made the synthesizer change its pitch either, so we have a single fundamental frequency for the entire utterance. Having said all that, here we go:

Listen

I took the liberty of increasing the volume of the audio file a little, just to make it easier to understand. But in case you didn't understand it anyway, the text I entered was:

"I know I am a real man."

Aside from sounding like a drunk alien with a cold, it's actually speaking words! This was a huge breakthrough for me.

Just for kicks, here's how the voice sounds if you slow it down:

Listen

Not quite a real man, but something resembling human speech none the less.