llm.c icon indicating copy to clipboard operation
llm.c copied to clipboard

MultiGPU training hangs

Open chinthysl opened this issue 2 months ago • 9 comments

mpirun with multiple GPUs is hanging after allocated 474 MiB for master copy of params Most probably due to the introduction of cudastreams. @karpathy @PeterZhizhin

chinthysl avatar May 06 '24 06:05 chinthysl

Bleh I can reproduce this 🤦‍♂️ . Would most likely be helpful to harden our CI to include multi-GPU training, as it falls outside of it. iirc @Ricardicus helped set it up, not sure if he has knowledge of.

karpathy avatar May 06 '24 10:05 karpathy

Hardening the CI would be great. But the CI here on github workflows only reaches as far as to do the CPU test only unfortunately.. I don't see a way to include real GPU tests automatically up here.. It is different if we had our own hardware we could connect to and run jobs on, but that would have to be a different solution.

Ricardicus avatar May 06 '24 10:05 Ricardicus

Hmm, I can't immediately think of a reason why CUDA streams would break this, NCCL is still on the default stream which should keep everything synchronised both ways: https://docs.nvidia.com/cuda/cuda-runtime-api/stream-sync-behavior.html

One way to roughly restore the previous behaviour would be to replace this:

    cudaCheck(cudaStreamCreate(&main_stream));
    cudaEventCreateWithFlags(&main_event, cudaEventDisableTiming);
    cudaEventCreateWithFlags(&loss_event, cudaEventDisableTiming);
    for (int i = 0; i < num_parallel_streams; i++) {
        cudaCheck(cudaStreamCreate(&parallel_streams[i]));
        cudaEventCreateWithFlags(&parallel_events[i], cudaEventDisableTiming);
    }

with this:

    main_stream = 0;
    cudaEventCreateWithFlags(&main_event, cudaEventDisableTiming);
    cudaEventCreateWithFlags(&loss_event, cudaEventDisableTiming);
    for (int i = 0; i < num_parallel_streams; i++) {
        parallel_streams[i] = 0;
        cudaEventCreateWithFlags(&parallel_events[i], cudaEventDisableTiming);
    }

I don't have a multi-GPU setup to give this a try right now though...

ademeure avatar May 06 '24 14:05 ademeure

@ademeure - still hangs with the above changes.

rosslwheeler avatar May 06 '24 19:05 rosslwheeler

@Ricardicus @ademeure still hangs. changing nccl stream to main_stream doesn’t help. Will further look into multi stream usage and memcopyasyncs.

chinthysl avatar May 07 '24 03:05 chinthysl

FWIW, commit 6c179fa from 2 days ago works fine with -DENABLE_BF16 -DMULTI_GPU (openmpi, nccl, no cuDNN) and gives about 225k tok/s for the gpt2-x model of TinyStories on a single H100 node of 8 GPU using a batch size of 10.

(This commit also works fine with cuDNN enabled, and it can fit up to a batch size of 18, but the throughput is lower in my hands; I'm not using latest cuDNN, so not sure how meaningful my numbers for cuDNN would be.)

pjj avatar May 07 '24 03:05 pjj

@karpathy @Ricardicus - I am asking about getting an Nvidia GPU CI runner for us. Will let you know as soon as I have some more info.

rosslwheeler avatar May 07 '24 04:05 rosslwheeler

In case it helps anyone else figure this out, the exact point that breaks multiGPU training for me on an 8-GPU node in today's TOT, is the first call to malloc_and_point, via malloc_and_point_activations

void* malloc_and_point(floatX** targets[], const size_t* act_sizes, size_t n) {
    size_t num_activations = 0;
    for (size_t i = 0; i < n; i++) {
        num_activations += act_sizes[i];
    }
    void* acts_memory;
    cudaCheck(cudaMalloc((void**)&acts_memory, num_activations * sizeof(floatX)));

that comes from the initialization of the activations below:

    // allocate space for all the activations if needed (done here, lazily)
    if(model->acts_memory == NULL) {
        // record the current B,T as well
        model->batch_size = B;
        model->seq_len = T;
        // allocate the space
        fill_in_activation_sizes(model->act_sizes, B, T, model->config);
        size_t num_activations = 0;
        for (size_t i = 0; i < NUM_ACTIVATION_TENSORS; i++) {
            num_activations += model->act_sizes[i];
        }
        model->num_activations = num_activations;
        model->acts_memory = malloc_and_point_activations(&model->acts, model->act_sizes);

The actual bug may be anywhere before this point. When watching the node via nvtop, it appears as if one of the processes using a GPU spawns another 7 of them all trying to allocate memory on the same GPU. Here is a snapshot of part of nvidia-smi at about 1 second before the crash:

| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    107774      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     7276MiB |
|    0   N/A  N/A    107775      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     3626MiB |
|    0   N/A  N/A    107776      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     3626MiB |
|    0   N/A  N/A    107777      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     3626MiB |
|    0   N/A  N/A    107778      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     3626MiB |
|    0   N/A  N/A    107779      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     3626MiB |
|    0   N/A  N/A    107780      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     3626MiB |
|    0   N/A  N/A    107781      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     3626MiB |
|    1   N/A  N/A    107775      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     4220MiB |
|    2   N/A  N/A    107776      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     4220MiB |
|    3   N/A  N/A    107777      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     4220MiB |
|    4   N/A  N/A    107778      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     4220MiB |
|    5   N/A  N/A    107779      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     4220MiB |
|    6   N/A  N/A    107780      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     4220MiB |
|    7   N/A  N/A    107781      C   ...xxxxxxxx/git/llm.c/./train_gpt2cu_2     3980MiB |
+---------------------------------------------------------------------------------------+

pjj avatar May 07 '24 22:05 pjj

Found the bug. common_start always set the gpu to idx 0. Doesn't take in MultiGPU config. Working on the fix. @pjj thanks for the analysis above. Also it's better to have a multi gpu test later.

chinthysl avatar May 08 '24 04:05 chinthysl