I’ve been trying to repeat my experiment from here but with data from one TIMIT speaker. That is, the goal is to train only on the data of one speaker going ‘aa’ (the middle third of every such phone, to try to reduce coarticulation issues) and get a network that generates an ‘aaa’ sound.
The network predicts the next three samples form the previous 160, has two rectified linear hidden layers and L2 regularization with parameter 10^(-4). It was trained with vanilla SGD and learning rate 0.3, batchsize 100. The input data was from speaker number 104, and was normalized by the maximum of the phone where each example came from. For the synthesis, the next sample was generated according to normal distribution with mean given by network output and standard deviation given by empirical MRSEs multiplied by 0.1. (It’s not mathematically right, but adding this multiplier makes the generated sound better. With a multiplier of 1 I get static.)
The above wav sounds kind of OK, but it’s not robust at all. Here is the same model generating sound when trained on data for speaker 179. It sounds terrible.
The MSEs of the two models (with data renormalized by empiricial standard deviation of all of TIMIT) are:
|speaker number||train mse||valid mse|
This suggests I have an overfitting problem (the gap between training mse and valid mse is much larger than Jean Philippe’s). In retrospect this is not very surprising since I’m training on very little data.
It also suggests that the data from speaker 179 is somehow much harder to “imitate” than the data from speaker 104. Not sure why that would be the case.
The next thing I’ll do is see if I can reduce overfitting by using stronger regularization.