llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

[User] Training examples sometimes gets broken when training data is in Japanese

Open Igoorx opened this issue 2 years ago • 3 comments

This is an issue to track the problem reported at https://github.com/ggerganov/llama.cpp/pull/1652#issuecomment-1586381277.

Expected Behavior

No characters in the examples.

Current Behavior

The examples sometimes contain characters (which aren't in the training data).

Failure Information

Example 0 during Training

image

Output of the trained model

image

Steps to Reproduce

Try to train using this training data: dataset.txt

Igoorx avatar Jun 13 '23 20:06 Igoorx

@Igoorx possible to provide the instructions on how you train the dataset? is there a tutorial of sorts u are following online?

kolinfluence avatar Jun 14 '23 13:06 kolinfluence

@kolinfluence You can find the instructions for training here: https://github.com/ggerganov/llama.cpp/blob/master/examples/train-text-from-scratch/README.md

But this is the command that I used:

train-text-from-scratch --vocab-model "chronos-13b.ggmlv3.q4_0.bin" --ctx 64 --embd 768 --head 12 --layer 6 --checkpoint-in chk-jp-256x16.bin --checkpoint-out chk-jp-256x16.bin --model-out ggml-jp-256x16-f32.bin --train-data "dataset.txt" -t 6 -b 16 -n 32 --seed 1 --adam-iter 16 --print-details-interval 1 --predict 16 --use-flash --mem-model 5

Igoorx avatar Jun 14 '23 19:06 Igoorx

I don't know very closely how the trainer works, but I assume it breaks the input into batches of tokens and runs the training on those.

The way the tokenizer in the stock LLaMA works is that a lot of non-Latin text will be split into multiple tokens per character. In this case, this batch cannot be converted back into a string losslessly, ~~but for the model the only thing that should matter is that the tokens are coming in the right order~~ (Never mind I saw the inference example.)

It shouldn't learn to generate broken UTF-8... Not sure how to handle this, maybe the batching could be better and add padding tokens instead.

Ultimately, a better tokenizer for the target language would be best.

SlyEcho avatar Jun 15 '23 15:06 SlyEcho

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 10 '24 01:04 github-actions[bot]