Finetune_LLMs
Finetune_LLMs copied to clipboard
Incorrect block size?
In your example_run.txt command line example for deepspeed, should "--block_size 2048" perhaps be set?
Without this, it looks like it's picking up the GPT2 default of 1024, but GPT-J rather than GPT-J's expected 2048.
It should also be OK to leave "--tokenizer_name gpt2" off entirely as then it should correctly initialize the default for GPT-J. In that case, specifying the block size probably would not be needed.
The way the code currently works is that it creates blocks of text from the samples in the dataset that are N tokens long. It will create overflows in this case which is fine for some tasks and not others(ok for books, not as good for more structured data).
Ideally, for more structured data, padding is used for each entry rather than wrapping of different entries.
In other words in the current state, 1024 vs 2048 just saves memory and makes it easier to run on machines with less Vram, while not affecting to much unless one wants inputs over 1024. The block_size argument allows one to raise that if they wish.
I think adding that to either the README or example would be a good idea. Feel free to open a PR, else I will add if I have time and don't forget.
The GPTJ tokenizer is the same as the GPT2 tokenizer. Perhaps removing that would make things easier though.
I have worked with adding padding rather than wrapping, may update the code in the near future to have that as an option.
Closing issue. Reopen if the issue still persists.