compound-word-transformer
compound-word-transformer copied to clipboard
Lack of validation set?
Hi there,
Thanks for the implementation! Appreciate if you could share more insight on why there's no valiadtion/test set involved during training?
Best,
Hi,
It's an interesting question. We did have such kind of discussion at the early stages. We used to run validation during training and found the validation loss would be extremely high and could not reflect the quality of generated results.
The conclusion is that "overfitting" is somewhat an important or necessary factor of a good generative LM model. Models with higher validation loss might generate better results because they have higher probabilities of "remembering" good sentences from humans. I recall that a paper mentioned this phenomenon as well (but I forget its title...).
Furthermore, the quality hugely depends on another factor - "sampling" at the "inference" stage. Combining the two factors, we considered that the runtime validation loss might not be very useful, so we discarded it in every following work.
Hi,
Thanks for the detailed reply.
I remember in a beginner course project where I supervised some students training the bach chorale dataset using CNN. The results turn out to be pretty good, with all the kinda voice leading and counterpointal movement. I was a bit surprised to see CNN could produce such good results. After diving deep into the code and I realize that there's no validation set involved. After some exploration, the generated results are basically "copying" whatever they've seen from the training set which couldnt reflect the generation & generalizing ability of the model. Have you checked such "plagiarism" effect on the generated results?
I still believe a validation/test set is needed during training. Else, why bother using the SOTA model (i.e. transformer) right? Why not just using a super-overfitting CNN with much more parameters which would result in equally good results?
Regarding sampling, I believe you only used top-k/top-p/temperature-regularized sampling right(correct me if im wrong)? given the overfitting behavior, the logits would tend to heavily distributed to the overfitting token(e.g. [1e4, 1e1, 1e-1, 1e-2]), hence top-p/top-k wouldn't affect much I believe unless you apply a super-high temperature?
Happy to discuss!
Hi,
It's an interesting question. We did have such kind of discussion at the early stages. We used to run validation during training and found the validation loss would be extremely high and could not reflect the quality of generated results.
The conclusion is that "overfitting" is somewhat an important or necessary factor of a good generative LM model. Models with higher validation loss might generate better results because they have higher probabilities of "remembering" good sentences from humans. I recall that a paper mentioned this phenomenon as well (but I forget its title...).
Furthermore, the quality hugely depends on another factor - "sampling" at the "inference" stage. Combining the two factors, we considered that the runtime validation loss might not be very useful, so we discarded it in every following work.
Hi,How to generate validation_songs.json?There seems to be no mention in the description of the dataset file.I would appreciate it if you could answer me