Finetune_LLMs Incorrect block size?

In your example_run.txt command line example for deepspeed, should "--block_size 2048" perhaps be set?

Without this, it looks like it's picking up the GPT2 default of 1024, but GPT-J rather than GPT-J's expected 2048.

It should also be OK to leave "--tokenizer_name gpt2" off entirely as then it should correctly initialize the default for GPT-J. In that case, specifying the block size probably would not be needed.

Mar 20 '22 21:03 jdwx

The way the code currently works is that it creates blocks of text from the samples in the dataset that are N tokens long. It will create overflows in this case which is fine for some tasks and not others(ok for books, not as good for more structured data).

Ideally, for more structured data, padding is used for each entry rather than wrapping of different entries.

In other words in the current state, 1024 vs 2048 just saves memory and makes it easier to run on machines with less Vram, while not affecting to much unless one wants inputs over 1024. The block_size argument allows one to raise that if they wish.

I think adding that to either the README or example would be a good idea. Feel free to open a PR, else I will add if I have time and don't forget.

The GPTJ tokenizer is the same as the GPT2 tokenizer. Perhaps removing that would make things easier though.

Mar 20 '22 21:03 mallorbc

I have worked with adding padding rather than wrapping, may update the code in the near future to have that as an option.

Mar 20 '22 21:03 mallorbc

Closing issue. Reopen if the issue still persists.

Mar 15 '23 01:03 mallorbc

Finetune_LLMs Finetune_LLMs copied to clipboard

Incorrect block size?

Finetune_LLMs
Finetune_LLMs copied to clipboard