scuda
scuda copied to clipboard
Nvidia cuda samples fail to run with SCUDA due to lack of cuda-elf parsing.
working with main (commit 4f3b9b8bfb14114bba3d98de4b5a900b9dc0f170 (HEAD -> main, origin/main, origin/HEAD)
I am trying to get some of the cuda sample to work remotely.
The client side on starting execution I see a call that goes out for cudaLaunchKernel on the client but does not make it to the server.
How to reproduce:
Server side is started with: ./local.sh server
Client side: checkout https://github.com/NVIDIA/cuda-samples and build the cuda samples I was working with /Samples/0_Introduction/simplePrintf,
Run it using LD_PRELOAD=./libscuda_12.6.so ./simplePrintf
Client side output:
Opening connection to server
decompression required; starting decompress...
decompressed return::: : 44
compared return::: : 44
//
//
//
//
//
.version 8.5
.target sm_52
.address_size 64
Starting device count:1
GPU Device 0: "Ada" with compute capability 8.9
Device 0: "NVIDIA GeForce RTX 4090" with Compute capability 8.9
printf() is called. Output:
Calling __cudaPushCallConfiguration
Calling cudaDeviceSynchronize: 647
Had to apply a patch to not build the binary statically
diff --git a/Samples/0_Introduction/simplePrintf/CMakeLists.txt b/Samples/0_Introduction/simplePrintf/CMakeLists.txt
index 444c4305..dedd061d 100644
--- a/Samples/0_Introduction/simplePrintf/CMakeLists.txt
+++ b/Samples/0_Introduction/simplePrintf/CMakeLists.txt
@@ -8,10 +8,10 @@ find_package(CUDAToolkit REQUIRED)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
-set(CMAKE_CUDA_ARCHITECTURES 50 52 60 61 70 72 75 80 86 87 89 90 100 101 120)
+set(CMAKE_CUDA_ARCHITECTURES 50 52 60 61 70 72 75 80 86 87 89 90)
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -Wno-deprecated-gpu-targets")
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
- # set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G") # enable cuda-gdb (expensive)
+ # qset(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -G") # enable cuda-gdb (expensive)
endif()
# Include directories and libraries
@@ -26,3 +26,4 @@ target_compile_options(simplePrintf PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--extende
target_compile_features(simplePrintf PRIVATE cxx_std_17 cuda_std_17)
set_target_properties(simplePrintf PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
+set_target_properties(simplePrintf PROPERTIES CUDA_RUNTIME_LIBRARY Shared)
Looks like the kernel function was not parsed properly when __cudaRegisterFatBinary was called so the client code failed on kernel launch call.
Debugged it further, looks like the problem is with __cudaRegisterFatBinary, the cuda samples form nvidia are by default stored as ELF, and require further processing to extract the cuda-kernel details from it.
@sushilks can you share your ptx/fatbin file that are generated with nvcc --keep?
vectorAdd.build_with_keep.tar.gz These are the artifact generated when build with "--keep"