LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

Is there a way to skip pre-tokenizing all samples?

Open XpastaX opened this issue 1 month ago • 0 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

The code will pre-tokenize all samples before loading models. However, when the dataset is large, this is quite slow and can cause OOM (I tired training with 200k samples and it overflows).

I am wondersing is there any support to skip this pre-tokenizing and do it during the training? It won't cost much time, compared with the fine-tuning.

Running tokenizer on dataset:  48%|███████████████████████████                                                                              | 21000/43378  [01:48<01:26, 177.43 examples/s]

Expected behavior

No response

System Info

No response

Others

No response

XpastaX avatar May 22 '24 12:05 XpastaX