LLaMA-Factory Is there a way to skip pre-tokenizing all samples?

Is there a way to skip pre-tokenizing all samples?

Open XpastaX opened this issue 1 month ago • 0 comments

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

The code will pre-tokenize all samples before loading models. However, when the dataset is large, this is quite slow and can cause OOM (I tired training with 200k samples and it overflows).

I am wondersing is there any support to skip this pre-tokenizing and do it during the training? It won't cost much time, compared with the fine-tuning.

Running tokenizer on dataset:  48%|███████████████████████████                                                                              | 21000/43378  [01:48<01:26, 177.43 examples/s]

Expected behavior

No response

System Info

No response

Others

No response

May 22 '24 12:05 XpastaX

LLaMA-Factory LLaMA-Factory copied to clipboard

Is there a way to skip pre-tokenizing all samples?

Reminder

Reproduction

Expected behavior

System Info

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard