Transformers.jl Example Code always produces Max Length Sequences

Example Code always produces Max Length Sequences

Open tawheeler opened this issue 11 months ago • 0 comments

Thank you for this great package!

I tried modifying the example "Copy Task" code to have a 50% chance of producing a 9-token string and otherwise produce a 10-token string:

sample_data() = (d = join(map(string, rand(1:10, (rand() < 0.5 ? 9 : 10))), ' '); (d,d))

When I train this, the model learns to always produce a 10-token string: 2024-03-09_16-39

I originally noticed this when I changed the code to only produce 1 or 2-token sequences, and there it also would only ever produce 2-token sequences after training. I suspect there is some issue with masking or maybe with the loss function, but I haven't figured it out yet.

FWIW, the loss never gets extremely low (~1e-5) like it does if you only train with 10-token sequences, but reaches about 0.5.

Mar 10 '24 00:03 tawheeler

Transformers.jl Transformers.jl copied to clipboard

Example Code always produces Max Length Sequences

Transformers.jl
Transformers.jl copied to clipboard