Transformers.jl
Transformers.jl copied to clipboard
Example Code always produces Max Length Sequences
Thank you for this great package!
I tried modifying the example "Copy Task" code to have a 50% chance of producing a 9-token string and otherwise produce a 10-token string:
sample_data() = (d = join(map(string, rand(1:10, (rand() < 0.5 ? 9 : 10))), ' '); (d,d))
When I train this, the model learns to always produce a 10-token string:
I originally noticed this when I changed the code to only produce 1 or 2-token sequences, and there it also would only ever produce 2-token sequences after training. I suspect there is some issue with masking or maybe with the loss function, but I haven't figured it out yet.
FWIW, the loss never gets extremely low (~1e-5) like it does if you only train with 10-token sequences, but reaches about 0.5.