LLaMA-Factory
LLaMA-Factory copied to clipboard
Is there a way to skip pre-tokenizing all samples?
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
The code will pre-tokenize all samples before loading models. However, when the dataset is large, this is quite slow and can cause OOM (I tired training with 200k samples and it overflows).
I am wondersing is there any support to skip this pre-tokenizing and do it during the training? It won't cost much time, compared with the fine-tuning.
Running tokenizer on dataset: 48%|███████████████████████████ | 21000/43378 [01:48<01:26, 177.43 examples/s]
Expected behavior
No response
System Info
No response
Others
No response