llama.cpp
llama.cpp copied to clipboard
[User] Training examples sometimes gets broken when training data is in Japanese
This is an issue to track the problem reported at https://github.com/ggerganov/llama.cpp/pull/1652#issuecomment-1586381277.
Expected Behavior
No � characters in the examples.
Current Behavior
The examples sometimes contain � characters (which aren't in the training data).
Failure Information
Example 0 during Training
Output of the trained model
Steps to Reproduce
Try to train using this training data: dataset.txt
@Igoorx possible to provide the instructions on how you train the dataset? is there a tutorial of sorts u are following online?
@kolinfluence You can find the instructions for training here: https://github.com/ggerganov/llama.cpp/blob/master/examples/train-text-from-scratch/README.md
But this is the command that I used:
train-text-from-scratch --vocab-model "chronos-13b.ggmlv3.q4_0.bin" --ctx 64 --embd 768 --head 12 --layer 6 --checkpoint-in chk-jp-256x16.bin --checkpoint-out chk-jp-256x16.bin --model-out ggml-jp-256x16-f32.bin --train-data "dataset.txt" -t 6 -b 16 -n 32 --seed 1 --adam-iter 16 --print-details-interval 1 --predict 16 --use-flash --mem-model 5
I don't know very closely how the trainer works, but I assume it breaks the input into batches of tokens and runs the training on those.
The way the tokenizer in the stock LLaMA works is that a lot of non-Latin text will be split into multiple tokens per character. In this case, this batch cannot be converted back into a string losslessly, ~~but for the model the only thing that should matter is that the tokens are coming in the right order~~ (Never mind I saw the inference example.)
It shouldn't learn to generate broken UTF-8... Not sure how to handle this, maybe the batching could be better and add padding tokens instead.
Ultimately, a better tokenizer for the target language would be best.
This issue was closed because it has been inactive for 14 days since being marked as stale.