CUDALibrarySamples icon indicating copy to clipboard operation
CUDALibrarySamples copied to clipboard

PBA Distance Transform using NPP returns non-exact EDT.

Open jeanchristopheruel opened this issue 1 year ago • 8 comments

Related to this topic I followed the examples at https://github.com/NVIDIA/CUDALibrarySamples/tree/master/NPP/distanceTransform In these examples, I set the first row a sites.

// nppiDistanceTransformPBA_8u32f_C1R_Ctx
[0]: 0 0 0 0 0 0 0 0 0 0
[1]: 1 1 1 1 1 1 1 1 1 1
[2]: 1 1 1 1 1 1 1 1 1 1
[3]: 3 3 3 3 3 3 3 3 3 3
[4]: 3 3 3 3 3 3 3 3 3 3
[5]: 4 4 4 4 4 4 4 4 4 4
[6]: 5 5 5 5 5 5 5 5 5 5
[7]: 6 6 6 6 6 6 6 6 6 6
[8]: 7 7 7 7 7 7 7 7 7 7
[9]: 8 8 8 8 8 8 8 8 8 8
[10]: 9 9 9 9 9 9 9 9 9 9
// nppiDistanceTransformAbsPBA_8u16u_C1R_Ctx
[0]: 0 0 0 0 0 0 0 0 0 0
[1]: 1 1 1 1 1 1 1 1 1 1
[2]: 1 1 1 1 1 1 1 1 1 1
[3]: 3 3 3 3 3 3 3 3 3 3
[4]: 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5
[5]: 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5
[6]: 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.5
[7]: 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5
[8]: 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5
[9]: 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5
[10]: 9.5 9.5 9.5 9.5 9.5 9.5 9.5 9.5 9.5 9.5

jeanchristopheruel avatar Aug 27 '24 06:08 jeanchristopheruel

Hi @jeanchristopheruel,

Thank you for reporting this issue. Could you please provide a small reproducible example along with details of your testing environment, including information about the GPUs, CUDA Toolkit version, and any other relevant configurations?

mkhadatare avatar Aug 28 '24 19:08 mkhadatare

Hey @mkhadatare I'm seeing the exact same issue on one system, but not on another. I am using nppiDistanceTransformPBA_8u32f_C1R_Ctx.

System with issue: RTX 4070 CUDA/NPP version: 11.8

System without issue: Jetson Orin Nano 4GB CUDA/NPP version: 11.4

The nature of the issue is exactly as demonstrated in @jeanchristopheruel's post. Locations which should have distances of 2 are reported to have distances of 1. This appears to right itself at distance 3. Then at positions which should have distance 4, we see distance 3.5. Then the rate of the distance increments correctly, though the values themselves are incorrect. (3.5, 4.5, 5.5 where it should be 4, 5, 6)

soooch avatar Sep 11 '24 22:09 soooch

This is internally tracked at 4832970 btw.

soooch avatar Sep 11 '24 22:09 soooch

Woohoo thanks a lot for the followup!

jeanchristopheruel avatar Sep 11 '24 22:09 jeanchristopheruel

Hi @soooch Thank you for reporting this. We are currently working on the issue under bug tracker ID 4832970 and will keep you updated on the progress.

mkhadatare avatar Sep 11 '24 22:09 mkhadatare

Hi we originally did not have this issue on Jetson Orin Nano 4GB. However, since upgrading to Jetpack 6 with Cuda NPP version 12.3.1 we seem to be observing this same problem now on Jetson. Here's a sample program

#include <stdio.h>
#include <stdlib.h>

#include "cuda_runtime.h"
#include "nppdefs.h"
#include "nppcore.h"
#include "nppi_filtering_functions.h"
#include "npps_initialization.h"

#define min(x, y) (((x) < (y)) ? (x) : (y))
#define nImageWidth 64
#define nImageHeight 64

int test_npp()
{
    NppStreamContext nppStreamCtx;
    nppStreamCtx.hStream = 0;

    cudaGetDevice(&nppStreamCtx.nCudaDeviceId);
    cudaDeviceGetAttribute(&nppStreamCtx.nCudaDevAttrComputeCapabilityMajor,
                           cudaDevAttrComputeCapabilityMajor,
                           nppStreamCtx.nCudaDeviceId);
    cudaDeviceGetAttribute(&nppStreamCtx.nCudaDevAttrComputeCapabilityMinor,
                           cudaDevAttrComputeCapabilityMinor,
                           nppStreamCtx.nCudaDeviceId);
    cudaStreamGetFlags(nppStreamCtx.hStream, &nppStreamCtx.nStreamFlags);

    struct cudaDeviceProp oDeviceProperties;
    cudaGetDeviceProperties(&oDeviceProperties, nppStreamCtx.nCudaDeviceId);
    nppStreamCtx.nMultiProcessorCount = oDeviceProperties.multiProcessorCount;
    nppStreamCtx.nMaxThreadsPerMultiProcessor = oDeviceProperties.maxThreadsPerMultiProcessor;
    nppStreamCtx.nMaxThreadsPerBlock = oDeviceProperties.maxThreadsPerBlock;
    nppStreamCtx.nSharedMemPerBlock = oDeviceProperties.sharedMemPerBlock;


    NppiSize oImageSizeROI = {nImageWidth, nImageHeight};

    Npp8u hostInputBuffer[nImageHeight * nImageWidth] = {0};
    for (int i = 0; i < nImageWidth; i++) {
      // Fill first row with 1s
      hostInputBuffer[i] = 1;
    }

    printf("Input array (%dx%d):\n", nImageWidth, nImageHeight);
    if (nImageWidth != 8) {
          printf("Truncating printed width to %d\n", 8);
    }
    for (int y = 0; y < nImageHeight; y++) {
        for (int x = 0; x < min(8, nImageWidth); x++) {
            printf("%d ", hostInputBuffer[y * nImageWidth + x]);
        }
        printf("\n");
    }

    Npp8u *pInputImage_Device = 0;
    Npp32f *pOutputImage_Device = 0;
    Npp8u *pScratchBuffer = 0;
    Npp32f *hostOutputBuffer = 0;

    size_t nScratchBufferSize;
    nppiDistanceTransformPBAGetBufferSize(oImageSizeROI, &nScratchBufferSize);
    printf("Scratch buffer size: %zu bytes\n", nScratchBufferSize);

    cudaMalloc((void **)&pScratchBuffer, nScratchBufferSize);
    cudaMalloc((void **)&pInputImage_Device, oImageSizeROI.width * sizeof(Npp8u) * oImageSizeROI.height);
    cudaMalloc((void **)&pOutputImage_Device, oImageSizeROI.width * sizeof(Npp32f) * oImageSizeROI.height);
    hostOutputBuffer = (Npp32f *)malloc(oImageSizeROI.width * sizeof(Npp32f) * oImageSizeROI.height);

    if (pScratchBuffer == 0 || pInputImage_Device == 0 || pOutputImage_Device == 0 || hostOutputBuffer == 0)
    {
        printf("Memory allocation failed\n");
        if (pScratchBuffer)
            cudaFree(pScratchBuffer);
        if (pInputImage_Device)
            cudaFree(pInputImage_Device);
        if (pOutputImage_Device)
            cudaFree(pOutputImage_Device);
        if (hostOutputBuffer)
            free(hostOutputBuffer);
        return -1;
    }

    cudaMemcpy(pInputImage_Device, hostInputBuffer, 
               oImageSizeROI.width * oImageSizeROI.height * sizeof(Npp8u), 
               cudaMemcpyHostToDevice);

    Npp8u nMinSiteValue = 1;
    Npp8u nMaxSiteValue = 1;

    printf("Running distance transform...\n");
    NppStatus status = nppiDistanceTransformPBA_8u32f_C1R_Ctx(
        pInputImage_Device, oImageSizeROI.width * sizeof(Npp8u),
        nMinSiteValue, nMaxSiteValue,
        0, 0,
        0, 0,
        0, 0,
        pOutputImage_Device, oImageSizeROI.width * sizeof(Npp32f),
        oImageSizeROI, pScratchBuffer, nppStreamCtx);

    if (status != NPP_SUCCESS)
    {
        printf("Distance transform failed with error: %d\n", status);
    }
    else
    {
        printf("Distance transform completed successfully!\n");
        
        cudaMemcpy(hostOutputBuffer, pOutputImage_Device,
                   oImageSizeROI.width * sizeof(Npp32f) * oImageSizeROI.height,
                   cudaMemcpyDeviceToHost);
        
        if (nImageHeight != 8) {
          printf("Truncating printed width to %d\n", 8);
        }
        for (int y = 0; y < nImageHeight; y++) {
            for (int x = 0; x < min(8, nImageWidth); x++) {
                printf("%.1f ", hostOutputBuffer[y * nImageWidth + x]);
            }
            printf("\n");
        }
    }

    cudaDeviceSynchronize();

    cudaFree(pScratchBuffer);
    cudaFree(pInputImage_Device);
    cudaFree(pOutputImage_Device);
    free(hostOutputBuffer);

    printf("Done!\n");
    return 0;
}

int main() {
  void *x = 0;
  cudaMallocHost((void **)&x, 8);
  if (x != 0) {
    cudaFreeHost(x);
    x = 0;
  }

  NppLibraryVersion * version = nppGetLibVersion();
  printf("%d %d %d\n", version->major, version->minor, version->build);

  test_npp();
}

as well as its output on jetson

12 3 1
Input array (64x64):
Truncating printed width to 8
1 1 1 1 1 1 1 1 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 
Scratch buffer size: 98840 bytes
Running distance transform...
Distance transform completed successfully!
Truncating printed width to 8
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 
3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 
4.5 4.5 4.5 4.5 4.5 4.5 4.5 4.5 
5.5 5.5 5.5 5.5 5.5 5.5 5.5 5.5 
6.5 6.5 6.5 6.5 6.5 6.5 6.5 6.5 
7.5 7.5 7.5 7.5 7.5 7.5 7.5 7.5 
8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 
9.5 9.5 9.5 9.5 9.5 9.5 9.5 9.5 
10.5 10.5 10.5 10.5 10.5 10.5 10.5 10.5 
11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 
12.5 12.5 12.5 12.5 12.5 12.5 12.5 12.5 
13.5 13.5 13.5 13.5 13.5 13.5 13.5 13.5 
14.5 14.5 14.5 14.5 14.5 14.5 14.5 14.5 
15.5 15.5 15.5 15.5 15.5 15.5 15.5 15.5 
16.5 16.5 16.5 16.5 16.5 16.5 16.5 16.5 
17.5 17.5 17.5 17.5 17.5 17.5 17.5 17.5 
18.5 18.5 18.5 18.5 18.5 18.5 18.5 18.5 
19.5 19.5 19.5 19.5 19.5 19.5 19.5 19.5 
20.5 20.5 20.5 20.5 20.5 20.5 20.5 20.5 
21.5 21.5 21.5 21.5 21.5 21.5 21.5 21.5 
22.5 22.5 22.5 22.5 22.5 22.5 22.5 22.5 
23.5 23.5 23.5 23.5 23.5 23.5 23.5 23.5 
24.5 24.5 24.5 24.5 24.5 24.5 24.5 24.5 
25.5 25.5 25.5 25.5 25.5 25.5 25.5 25.5 
26.5 26.5 26.5 26.5 26.5 26.5 26.5 26.5 
27.5 27.5 27.5 27.5 27.5 27.5 27.5 27.5 
28.5 28.5 28.5 28.5 28.5 28.5 28.5 28.5 
29.5 29.5 29.5 29.5 29.5 29.5 29.5 29.5 
30.5 30.5 30.5 30.5 30.5 30.5 30.5 30.5 
31.5 31.5 31.5 31.5 31.5 31.5 31.5 31.5 
32.5 32.5 32.5 32.5 32.5 32.5 32.5 32.5 
33.5 33.5 33.5 33.5 33.5 33.5 33.5 33.5 
34.5 34.5 34.5 34.5 34.5 34.5 34.5 34.5 
35.5 35.5 35.5 35.5 35.5 35.5 35.5 35.5 
36.5 36.5 36.5 36.5 36.5 36.5 36.5 36.5 
37.5 37.5 37.5 37.5 37.5 37.5 37.5 37.5 
38.5 38.5 38.5 38.5 38.5 38.5 38.5 38.5 
39.5 39.5 39.5 39.5 39.5 39.5 39.5 39.5 
40.5 40.5 40.5 40.5 40.5 40.5 40.5 40.5 
41.5 41.5 41.5 41.5 41.5 41.5 41.5 41.5 
42.5 42.5 42.5 42.5 42.5 42.5 42.5 42.5 
43.5 43.5 43.5 43.5 43.5 43.5 43.5 43.5 
44.5 44.5 44.5 44.5 44.5 44.5 44.5 44.5 
45.5 45.5 45.5 45.5 45.5 45.5 45.5 45.5 
46.5 46.5 46.5 46.5 46.5 46.5 46.5 46.5 
47.5 47.5 47.5 47.5 47.5 47.5 47.5 47.5 
48.5 48.5 48.5 48.5 48.5 48.5 48.5 48.5 
49.5 49.5 49.5 49.5 49.5 49.5 49.5 49.5 
50.5 50.5 50.5 50.5 50.5 50.5 50.5 50.5 
51.5 51.5 51.5 51.5 51.5 51.5 51.5 51.5 
52.5 52.5 52.5 52.5 52.5 52.5 52.5 52.5 
53.5 53.5 53.5 53.5 53.5 53.5 53.5 53.5 
54.5 54.5 54.5 54.5 54.5 54.5 54.5 54.5 
55.5 55.5 55.5 55.5 55.5 55.5 55.5 55.5 
56.5 56.5 56.5 56.5 56.5 56.5 56.5 56.5 
57.5 57.5 57.5 57.5 57.5 57.5 57.5 57.5 
58.5 58.5 58.5 58.5 58.5 58.5 58.5 58.5 
59.5 59.5 59.5 59.5 59.5 59.5 59.5 59.5 
60.5 60.5 60.5 60.5 60.5 60.5 60.5 60.5 
61.5 61.5 61.5 61.5 61.5 61.5 61.5 61.5 
62.5 62.5 62.5 62.5 62.5 62.5 62.5 62.5 
Done!

MilesConn avatar Mar 24 '25 18:03 MilesConn

@mkhadatare any updates on 4832970? Thanks so much for your help. Also @soooch and I work together which is why we're both observing the issue on Jetson

MilesConn avatar Mar 24 '25 18:03 MilesConn

Sorry to spam this issue but we found that pDstVoronoiIndices is accurate. So we'll be using that instead of the pDstTransform output. Hope this helps someone

MilesConn avatar Mar 25 '25 00:03 MilesConn

Hi, the issue has been resolved in the latest CUDA Toolkit 12.9. Please refer to the release notes for more details.

mkhadatare avatar Jun 26 '25 05:06 mkhadatare