llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

train-text-from-scratch.exe stop after "begin training" (tensor->src0 is null)

Open Entretoize opened this issue 1 year ago • 8 comments

I'm running the latest release (master-254a7a7) like that:

bin\train-text-from-scratch.exe --vocab-model models\ggml-vocab.bin --checkpoint-in chk-lamartine-256x16.bin --checkpoint-out chk-lamartine-256x16.bin --model-out ggml-lamartine-265x16-f32.bin --train-data "shakespeare.txt" I tried with several models.

Expected Behavior

Training shoud run for a long time

Current Behavior

Training stop immediatly without error:

D:\git\llama.cpp>bin\train-text-from-scratch.exe --vocab-model models\ggml-vocab.bin --ctx 64 --embd 256 --head 8 --layer 16 --checkpoint-in chk-lamartine-256x16.bin --checkpoint-out chk-lamartine-256x16.bin --model-out ggml-lamartine-265x16-f32.bin --train-data "alphonsedelamartine.txt" -t 6 -b 1 -n 32 --seed 2 --adam-iter 16 --print-details-interval 0 --predict 16 --use-flash
main: seed: 2
llama.cpp: loading model from models\ggml-vocab.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
main: tokenize training data
main: number of training tokens: 474
print_params: n_vocab: 32000
print_params: n_ctx:   64
print_params: n_embd:  256
print_params: n_mult:  256
print_params: n_head:  8
print_params: n_ff:    768
print_params: n_layer: 16
print_params: n_rot:   32
main: number of unique tokens: 253
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080
main: init model
load_checkpoint: Training iterations: 0.
load_checkpoint: Training samples:    0.
load_checkpoint: Training tokens:     0.
main: opt iter 0
used_mem model+cache: 242364416 bytes
main: begin training

Environment and Context

Windows 11 NVidia RTX 3080 Ryzen 7 2700 Ram 32GB

Entretoize avatar Jun 15 '23 07:06 Entretoize

It's really fast now.

SlyEcho avatar Jun 15 '23 14:06 SlyEcho

Funny... my fault, I should have added that no checkpoint nor model is created in any folder.

Entretoize avatar Jun 15 '23 15:06 Entretoize

Sorry about joking but this tool is still very new so it has some problems. There are quite a few issues.

SlyEcho avatar Jun 15 '23 15:06 SlyEcho

But how can I check the problem, can I debug with visual studio ?

Entretoize avatar Jun 16 '23 05:06 Entretoize

Just tried in visual studio this is tensor->src0 (and 1) that are null in ggml-cuda.cu in function ggml_cuda_compute_forward maybe that help ? It runs if I disable CUBLAS. At least it seems, it is running for some minutes now, but I suppose it will be very slow ?

Entretoize avatar Jun 16 '23 06:06 Entretoize

For me also it works without CUBLAS (successfully trains the model) and does not work with CUBLAS (quits without creating a model file).

robyngraf avatar Jun 16 '23 08:06 robyngraf

I added if (node->src0!=NULL) at line 16009 of ggml.c:

        if (node->src0!=NULL)
            ggml_compute_forward(&params, node);

As it is in a loop and other nodes have src0 not null. It doesn't crash now, and seems to learn, but I don't know the aftermaths of doing that.

Entretoize avatar Jun 16 '23 08:06 Entretoize

That didn't quite do it for me, but then I added similar checks to the other calls to ggml_compute_forward in ggml.c as well and it seems to have started training now. Or it least it's getting further than it did before.

robyngraf avatar Jun 16 '23 11:06 robyngraf

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 10 '24 01:04 github-actions[bot]