scuda Nvidia cuda samples fail to run with SCUDA due to lack of cuda-elf parsing.

working with main (commit 4f3b9b8bfb14114bba3d98de4b5a900b9dc0f170 (HEAD -> main, origin/main, origin/HEAD)

I am trying to get some of the cuda sample to work remotely.
The client side on starting execution I see a call that goes out for cudaLaunchKernel on the client but does not make it to the server.

How to reproduce:

Server side is started with: ./local.sh server

Client side: checkout https://github.com/NVIDIA/cuda-samples and build the cuda samples I was working with /Samples/0_Introduction/simplePrintf,

Run it using LD_PRELOAD=./libscuda_12.6.so ./simplePrintf

Client side output:

Opening connection to server
decompression required; starting decompress...
decompressed return::: : 44 
compared return::: : 44 
//
//
//
//
//

.version 8.5
.target sm_52
.address_size 64




Starting device count:1
GPU Device 0: "Ada" with compute capability 8.9

Device 0: "NVIDIA GeForce RTX 4090" with Compute capability 8.9
printf() is called. Output:

Calling __cudaPushCallConfiguration
Calling cudaDeviceSynchronize: 647

Had to apply a patch to not build the binary statically

diff --git a/Samples/0_Introduction/simplePrintf/CMakeLists.txt b/Samples/0_Introduction/simplePrintf/CMakeLists.txt
index 444c4305..dedd061d 100644
--- a/Samples/0_Introduction/simplePrintf/CMakeLists.txt
+++ b/Samples/0_Introduction/simplePrintf/CMakeLists.txt
@@ -8,10 +8,10 @@ find_package(CUDAToolkit REQUIRED)
 
 set(CMAKE_POSITION_INDEPENDENT_CODE ON)
 
-set(CMAKE_CUDA_ARCHITECTURES 50 52 60 61 70 72 75 80 86 87 89 90 100 101 120)
+set(CMAKE_CUDA_ARCHITECTURES 50 52 60 61 70 72 75 80 86 87 89 90)
 set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Wno-deprecated-gpu-targets")
 if(CMAKE_BUILD_TYPE STREQUAL "Debug")
-    # set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")  # enable cuda-gdb (expensive)
+    # qset(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G")  # enable cuda-gdb (expensive)
 endif()
 
 # Include directories and libraries
@@ -26,3 +26,4 @@ target_compile_options(simplePrintf PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--extende
 target_compile_features(simplePrintf PRIVATE cxx_std_17 cuda_std_17)
 
 set_target_properties(simplePrintf PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
+set_target_properties(simplePrintf PROPERTIES CUDA_RUNTIME_LIBRARY Shared)

Mar 11 '25 20:03 sushilks

Looks like the kernel function was not parsed properly when __cudaRegisterFatBinary was called so the client code failed on kernel launch call.

Mar 11 '25 23:03 sushilks

Debugged it further, looks like the problem is with __cudaRegisterFatBinary, the cuda samples form nvidia are by default stored as ELF, and require further processing to extract the cuda-kernel details from it.

Mar 14 '25 17:03 sushilks

@sushilks can you share your ptx/fatbin file that are generated with nvcc --keep?

Mar 14 '25 18:03 kevmo314

vectorAdd.build_with_keep.tar.gz These are the artifact generated when build with "--keep"

Mar 17 '25 01:03 sushilks

scuda scuda copied to clipboard

Nvidia cuda samples fail to run with SCUDA due to lack of cuda-elf parsing.

scuda
scuda copied to clipboard