Finetune_LLMs icon indicating copy to clipboard operation
Finetune_LLMs copied to clipboard

Incorrect block size?

Open jdwx opened this issue 2 years ago • 2 comments

In your example_run.txt command line example for deepspeed, should "--block_size 2048" perhaps be set?

Without this, it looks like it's picking up the GPT2 default of 1024, but GPT-J rather than GPT-J's expected 2048.

It should also be OK to leave "--tokenizer_name gpt2" off entirely as then it should correctly initialize the default for GPT-J. In that case, specifying the block size probably would not be needed.

jdwx avatar Mar 20 '22 21:03 jdwx

The way the code currently works is that it creates blocks of text from the samples in the dataset that are N tokens long. It will create overflows in this case which is fine for some tasks and not others(ok for books, not as good for more structured data).

Ideally, for more structured data, padding is used for each entry rather than wrapping of different entries.

In other words in the current state, 1024 vs 2048 just saves memory and makes it easier to run on machines with less Vram, while not affecting to much unless one wants inputs over 1024. The block_size argument allows one to raise that if they wish.

I think adding that to either the README or example would be a good idea. Feel free to open a PR, else I will add if I have time and don't forget.

The GPTJ tokenizer is the same as the GPT2 tokenizer. Perhaps removing that would make things easier though.

mallorbc avatar Mar 20 '22 21:03 mallorbc

I have worked with adding padding rather than wrapping, may update the code in the near future to have that as an option.

mallorbc avatar Mar 20 '22 21:03 mallorbc

Closing issue. Reopen if the issue still persists.

mallorbc avatar Mar 15 '23 01:03 mallorbc