Add back support for longest sequence first

Open carmocca opened this issue 1 year ago • 0 comments

@awaelchli Semi related to this PR. I just noticed that we don't have the code to run the longest sample at the beginning of training anymore: https://github.com/Lightning-AI/litgpt/blob/globals/finetune/lora.py#L268-L270 Should we add it back? It's useful to OOM as soon as possible. If not, let's drop the longest_seq_ix variable entirely

I'm fine returning the longest element first if that's possible to implement in the SFTDataset. I would also move the responsibility of selecting the longest sample to the datamodule/dataset so that this logic doesn't have to be in the script, and we expose a simple method/attribute for the longest_seq_length. This way, we can precompute it during the loading of the dataset and don't have to iterate through the whole dataset again.

From https://github.com/Lightning-AI/litgpt/pull/1179#discussion_r1538392383

Mar 26 '24 16:03 carmocca