llama.cpp
llama.cpp copied to clipboard
train-text-from-scratch.exe stop after "begin training" (tensor->src0 is null)
I'm running the latest release (master-254a7a7) like that:
bin\train-text-from-scratch.exe --vocab-model models\ggml-vocab.bin --checkpoint-in chk-lamartine-256x16.bin --checkpoint-out chk-lamartine-256x16.bin --model-out ggml-lamartine-265x16-f32.bin --train-data "shakespeare.txt"
I tried with several models.
Expected Behavior
Training shoud run for a long time
Current Behavior
Training stop immediatly without error:
D:\git\llama.cpp>bin\train-text-from-scratch.exe --vocab-model models\ggml-vocab.bin --ctx 64 --embd 256 --head 8 --layer 16 --checkpoint-in chk-lamartine-256x16.bin --checkpoint-out chk-lamartine-256x16.bin --model-out ggml-lamartine-265x16-f32.bin --train-data "alphonsedelamartine.txt" -t 6 -b 1 -n 32 --seed 2 --adam-iter 16 --print-details-interval 0 --predict 16 --use-flash
main: seed: 2
llama.cpp: loading model from models\ggml-vocab.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 1 (mostly F16)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
main: tokenize training data
main: number of training tokens: 474
print_params: n_vocab: 32000
print_params: n_ctx: 64
print_params: n_embd: 256
print_params: n_mult: 256
print_params: n_head: 8
print_params: n_ff: 768
print_params: n_layer: 16
print_params: n_rot: 32
main: number of unique tokens: 253
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080
main: init model
load_checkpoint: Training iterations: 0.
load_checkpoint: Training samples: 0.
load_checkpoint: Training tokens: 0.
main: opt iter 0
used_mem model+cache: 242364416 bytes
main: begin training
Environment and Context
Windows 11 NVidia RTX 3080 Ryzen 7 2700 Ram 32GB
It's really fast now.
Funny... my fault, I should have added that no checkpoint nor model is created in any folder.
Sorry about joking but this tool is still very new so it has some problems. There are quite a few issues.
But how can I check the problem, can I debug with visual studio ?
Just tried in visual studio this is tensor->src0 (and 1) that are null in ggml-cuda.cu in function ggml_cuda_compute_forward maybe that help ? It runs if I disable CUBLAS. At least it seems, it is running for some minutes now, but I suppose it will be very slow ?
For me also it works without CUBLAS (successfully trains the model) and does not work with CUBLAS (quits without creating a model file).
I added if (node->src0!=NULL)
at line 16009 of ggml.c:
if (node->src0!=NULL)
ggml_compute_forward(¶ms, node);
As it is in a loop and other nodes have src0 not null. It doesn't crash now, and seems to learn, but I don't know the aftermaths of doing that.
That didn't quite do it for me, but then I added similar checks to the other calls to ggml_compute_forward in ggml.c as well and it seems to have started training now. Or it least it's getting further than it did before.
This issue was closed because it has been inactive for 14 days since being marked as stale.