HAIS icon indicating copy to clipboard operation
HAIS copied to clipboard

CUDA kernel failed : an illegal memory access was encountered

Open HelinXu opened this issue 3 years ago • 7 comments
trafficstars

Hi! Thanks for the great work. I'm running HAIS on my own dataset, and have modified class_numpoint_mean_dict and class_radius_mean accordingly. The problem is, I get CUDA kernel failed: an illegal memory access was encountered when testing. I traced the problem by inserting cudaGetLastError() in hiererchical_aggregation.cu, and found the code breaks around the following lines:

    assert(primary_num <= MAX_PRIMARY_NUM);
    concat_fragments_<<<primary_num, (int)1>>>(
        cuda_fragment_idxs, cuda_fragment_offsets,
        cuda_primary_idxs, cuda_primary_offsets,
        cuda_primary_absorb_fragment_idx, cuda_primary_absorb_fragment_cnt,
        cuda_concat_idxs, cuda_concat_point_num,
        primary_num);
    cudaDeviceSynchronize();

// got CUDA kernel failed : an illegal memory access was encountered here.

Do you have any suggestions?

HelinXu avatar Feb 27 '22 05:02 HelinXu

It seems a problem of memory overflow. If point clouds of your datasets are much denser or larger than ScanNet, I suggest to increase the values of these macros. https://github.com/hustvl/HAIS/blob/73a6f591ad29e3399ea83e9a669873fe9bb05ab0/lib/hais_ops/src/hierarchical_aggregation/hierarchical_aggregation.cu#L8-L11

outsidercsy avatar Mar 09 '22 02:03 outsidercsy

Hi, sorry to bother you again.

Each point cloud in my dataset always consists of 20,000 points. I've tried your suggestion out, and have increased/decreased those macros. I found that with different macros, the number of samples passes before failing is actually different! For example, with original settings, I can get 8 samples to pass the test script while increasing the numbers by 16 times, I can only get 3 samples to pass. It breaks after either one of the case (both after sync):

    // // for each fragment, find its primary
    int *cuda_primary_absorb_fragment_idx; // array for saving the fragment idxs
    int *cuda_primary_absorb_fragment_cnt; // array for saving the fragment nums
    cudaMalloc((void**)&cuda_primary_absorb_fragment_idx, primary_num * MAX_PER_PRIMARY_ABSORB_FRAGMENT_NUM * sizeof(int) + sizeof(int));
    cudaMalloc((void**)&cuda_primary_absorb_fragment_cnt, primary_num * sizeof(int) + sizeof(int));
    printf("fragment_num = %d\n", fragment_num); // I've printed out fragment number, which is mostly 10~40, never 0 on my dataset.
    if (fragment_num != 0) {
        fragment_find_primary_<<<int(DIVUP(fragment_num, MAX_THREADS_PER_BLOCK)), (int)MAX_THREADS_PER_BLOCK>>>(
            primary_num, cuda_primary_offsets, cuda_primary_centers,
            fragment_num, cuda_fragment_offsets, cuda_fragment_centers,
            cuda_primary_absorb_fragment_idx, cuda_primary_absorb_fragment_cnt);
        }
    err  = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "before sync CUDA kernel failed : %s\n", cudaGetErrorString(err)); // This never fails in my case.
        exit(-1);
    }
    cudaDeviceSynchronize();
    err  = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, " after sync CUDA kernel failed : %s\n", cudaGetErrorString(err)); // However, this fails!
        exit(-1);
    }
    // concatenate fragments belonging to the same primary
    int *cuda_concat_idxs;
    int *cuda_concat_point_num;
    cudaMalloc((void**)&cuda_concat_idxs, primary_num * MAX_PER_PRIMARY_ABSORB_POINT_NUM * 2 * sizeof(int) + sizeof(int));
    cudaMalloc((void**)&cuda_concat_point_num, primary_num *  sizeof(int) + sizeof(int));
    assert(primary_num <= MAX_PRIMARY_NUM);
    concat_fragments_<<<primary_num, (int)1>>>(
        cuda_fragment_idxs, cuda_fragment_offsets,
        cuda_primary_idxs, cuda_primary_offsets,
        cuda_primary_absorb_fragment_idx, cuda_primary_absorb_fragment_cnt,
        cuda_concat_idxs, cuda_concat_point_num,
        primary_num);
    err  = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "before sync 2 CUDA kernel failed : %s\n", cudaGetErrorString(err));
        exit(-1);
    }
    cudaDeviceSynchronize();
    err  = cudaGetLastError();
    if (cudaSuccess != err) {
        fprintf(stderr, "after sync 2 CUDA kernel failed : %s\n", cudaGetErrorString(err));
        exit(-1);
    }

HelinXu avatar Mar 20 '22 07:03 HelinXu

@HelinXu Hi, I have the same problem as you, How did you solve it?

Dragon-Vae avatar May 19 '22 02:05 Dragon-Vae

@Dragon-Vae Hi, I am on a different project and have not fully understood this behavior yet. I'll let you know if I figure it out :)

HelinXu avatar May 19 '22 02:05 HelinXu

@HelinXu OK, thank you for your reply

Dragon-Vae avatar May 19 '22 02:05 Dragon-Vae

@outsidercsy Hi, I also encountered an illegal memory access issue, will #32 be the trick?

eamonn-zh avatar Aug 05 '22 23:08 eamonn-zh

@eamonn-zh I encountered the same question,and I found that when I change the batch size to 4 instead of 1,there is no problem. So I think maybe it's not because of memory overflow? And do you figure out this question now?

hmax233 avatar Oct 10 '22 03:10 hmax233