vuda segmentation fault on simple example program due to null pointer reference

Hi, trying Vuda for the first time here. To get the simple example to compile, I had to fix a few errors in the inc folder that my g++ (version 12.2) was complaining about. I'll file those changes as a PR later, they are minor and not relevant.

But the main problem i'm running into getting the examples to work is that I get a segfault when using cudaMemcpy.

This is the backtrace from GDB:

(gdb) backtrace
#0  __memcpy_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:496
#1  0x0000555555575cfb in vuda::detail::logical_device::memcpyToDevice(std::thread::id, void*, void const*, unsigned long, unsigned int) ()
#2  0x0000555555578bd7 in vuda::memcpy(void*, void const*, unsigned long, vuda::memcpyKind, unsigned int) ()
#3  0x0000555555578cda in cudaMemcpy(void*, void const*, unsigned long, vuda::memcpyKind) ()
#4  0x000055555555c7d0 in main ()

Using printf, i further pinpointed the issue to the std::memcpy line here:

https://github.com/jgbit/vuda/blob/7e7c3348fafa111098c7ec36cc71d4d864753f87/inc/state/logicaldevice.inl#L490-L496

Apparently src_ptr is non-null, but src_ptr->get_memptr() is returning null. Internally this returns m_ptrMemBlock->get_ptr(), which returns a private class member void* m_ptr.

My first thought was that maybe one of the cudaMalloc calls failed, so I tried checking the cudaError_t return values, but everything is returning cudaSuccess. So this seems like an internal bug, but i'm not sure how to fix it, because I don't know how this pointer is supposed to be populated. Everywhere my print statements show it just gets set to null.

Sep 28 '25 20:09 conduition

Here is the full diff showing my local changes, debug statements included.

diff --git a/inc/state/logicaldevice.inl b/inc/state/logicaldevice.inl
index 46383fc..e219c34 100644
--- a/inc/state/logicaldevice.inl
+++ b/inc/state/logicaldevice.inl
@@ -467,6 +467,8 @@ namespace vuda
             assert(stream >= 0 && stream < m_queueComputeCount);
             bool use_staged = false;
 
+            printf("1\n");
+
             //
             // all threads can read from the memory resources on the logical device
             {
@@ -474,10 +476,12 @@ namespace vuda
 
                 const default_storage_node* dst_node = m_storage.search_range(m_storageBST_root, dst);
                 assert(dst_node != nullptr);
+                printf("2\n");
 
                 //
                 // NOTE: src can be any ptr, we must check if the pointer is a known resource node or a pageable host allocation
                 internal_node* src_ptr = m_storage.search_range(m_storageBST_root, const_cast<void*>(src));
+                printf("3\n");
 
                 if(src_ptr == nullptr)
                 {
@@ -486,14 +490,18 @@ namespace vuda
                     //
                     // perform stream sync before initiating the copy
                     FlushQueue(tid, stream);
+                    printf("4\n");
 
                     //
                     // request a pinned (internal) staging buffer
                     // copy the memory to a pinned staging buffer which is allocated with host visible memory (this is the infamous double copy)
                     // copy from stage buffer to device
                     use_staged = true;
+                    printf("5.1 %ld\n", count);
                     src_ptr = m_pinnedBuffers.get_buffer(count, m_allocator);
+                    printf("5.2 %p %p %lu\n", src_ptr->get_memptr(), src, count);
                     std::memcpy(src_ptr->get_memptr(), src, count);
+                    printf("6\n");
 
                     /*std::ostringstream ostr;
                     ostr << "tid: " << std::this_thread::get_id() << ", using staged node: " << stage_ptr << std::endl;
diff --git a/inc/state/memoryallocator.hpp b/inc/state/memoryallocator.hpp
index 3ed35b3..48e93b5 100644
--- a/inc/state/memoryallocator.hpp
+++ b/inc/state/memoryallocator.hpp
@@ -13,6 +13,7 @@ namespace vuda
             // allocate
             memory_block(const vk::DeviceMemory memory, const vk::DeviceSize offset, const vk::DeviceSize size, const vk::Buffer& buffer) : m_memory(memory), m_offset(offset), m_size(size), m_ptr(nullptr), m_buffer(buffer)
             {
+                printf("new memory_block: %p\n", m_ptr);
             }
 
             //
@@ -26,6 +27,7 @@ namespace vuda
             void reallocate(const vk::DeviceSize offset, const vk::DeviceSize size, void* ptr)
             {
                 // before calling allocate test_and_set must have been called
+                printf("memory_block.reallocate: %p\n", ptr);
                 m_offset = offset;
                 m_size = size;
                 m_ptr = ptr;
diff --git a/inc/state/node_internal.hpp b/inc/state/node_internal.hpp
index 2d7c4c9..472ce4b 100644
--- a/inc/state/node_internal.hpp
+++ b/inc/state/node_internal.hpp
@@ -103,6 +103,8 @@ namespace vuda
                 // lock
                 std::lock_guard<std::mutex> lck(*m_mtx);
 
+                printf("5.1.1 %ld\n", size);
+
                 //
                 // find free buffer
                 BufferType *hcb = nullptr;
@@ -115,6 +117,7 @@ namespace vuda
                         if(m_buffers[i]->GetSize() >= size)
                         {
                             hcb = m_buffers[i].get();
+                            printf("5.1.2 found buffer %d %p\n", i, hcb);
                             break;
                         }
                         else
@@ -134,6 +137,7 @@ namespace vuda
                     hcb = m_buffers.back().get();
                 }
 
+                printf("5.1.3 returning buffer %p\n", hcb);
                 return hcb;
             }
 
diff --git a/inc/state/pool.hpp b/inc/state/pool.hpp
index 5e8c36d..1fcee24 100644
--- a/inc/state/pool.hpp
+++ b/inc/state/pool.hpp
@@ -172,7 +172,7 @@ namespace vuda
 
             void reset(const vk::Device device)
             {
-                device.resetFences(1, &m_fence);
+                vk::Result res = device.resetFences(1, &m_fence);
 
                 //
                 // return all descriptor sets to their respective pools
diff --git a/inc/vuda.hpp b/inc/vuda.hpp
index 4d5deb2..7971eb8 100644
--- a/inc/vuda.hpp
+++ b/inc/vuda.hpp
@@ -8,6 +8,7 @@
 #include <sstream>
 #include <fstream>
 #include <thread>
+#include <memory>
 #include <mutex>
 #include <shared_mutex>
 #include <atomic>
diff --git a/samples/simple/Makefile b/samples/simple/Makefile
index f94b76f..db61118 100644
--- a/samples/simple/Makefile
+++ b/samples/simple/Makefile
@@ -5,7 +5,7 @@ SOURCES=$(EXECUTABLE).cpp
 CUDA_SRC=$(EXECUTABLE).cpp
 
 $(EXECUTABLE): $(SOURCES)
-	$(CC) $(CFLAGS) $^ -o $@ $(INCLUDE) $(LDFLAGS)
+	$(CC) -g $(CFLAGS) $^ -o $@ $(INCLUDE) $(LDFLAGS)
 	glslangValidator -V add.comp -o add.spv
 
 cuda: $(CUDA_SRC)
diff --git a/samples/simple/simple.cpp b/samples/simple/simple.cpp
index 7db22b2..f4f0c1b 100644
--- a/samples/simple/simple.cpp
+++ b/samples/simple/simple.cpp
@@ -24,10 +24,19 @@ __global__ void add(const int* dev_a, const int* dev_b, int* dev_c, const int N)
 
 #endif
 
-int main(void)
-{
+void handleError(cudaError_t err) {
+    if (err != cudaSuccess) {
+        // const char* name = cudaGetErrorName(err);
+        const char* err_str = cudaGetErrorString(err);
+        printf("%s\n", err_str);
+        exit(1);
+    }
+}
+
+int main(void) {
     // assign a device to the thread
-    cudaSetDevice(0);
+    handleError(cudaSetDevice(0));
+
     // allocate memory on the device
     const int N = 5000;
     int a[N], b[N], c[N];
@@ -37,12 +46,24 @@ int main(void)
         b[i] = i * i;
     }
     int *dev_a, *dev_b, *dev_c;
-    cudaMalloc((void**)&dev_a, N * sizeof(int));
-    cudaMalloc((void**)&dev_b, N * sizeof(int));
-    cudaMalloc((void**)&dev_c, N * sizeof(int));
+
+    printf("dev_a before cudaMalloc: %p\n", dev_a);
+    printf("dev_b before cudaMalloc: %p\n", dev_b);
+    printf("dev_c before cudaMalloc: %p\n", dev_c);
+
+    handleError(cudaMalloc((void**)&dev_a, N * sizeof(int)));
+    handleError(cudaMalloc((void**)&dev_b, N * sizeof(int)));
+    handleError(cudaMalloc((void**)&dev_c, N * sizeof(int)));
+
+    printf("dev_a after cudaMalloc: %p\n", dev_a);
+    printf("dev_b after cudaMalloc: %p\n", dev_b);
+    printf("dev_c after cudaMalloc: %p\n", dev_c);
+
+
     // copy the arrays a and b to the device
-    cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
-    cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
+    handleError(cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice));
+    printf("out safely\n");
+    handleError(cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice));
     // run kernel (vulkan shader module)
     const int blocks = 128;
     const int threads = 128;

Sep 28 '25 20:09 conduition

Here is the debugging output. I get this same output both on my dev machine and on a GPU cloud node I rented.

dev_a before cudaMalloc: (nil)
dev_b before cudaMalloc: (nil)
dev_c before cudaMalloc: (nil)
new memory_block: (nil)
new memory_block: (nil)
memory_block.reallocate: (nil)
5.1.1 20000
new memory_block: (nil)
memory_block.reallocate: (nil)
5.1.3 returning buffer 0x5c9575bf77c0
5.1.1 20000
memory_block.reallocate: (nil)
5.1.3 returning buffer 0x5c9575bf78b0
memory_block.reallocate: (nil)
5.1.1 20000
new memory_block: (nil)
memory_block.reallocate: (nil)
5.1.3 returning buffer 0x5c9575bf7950
5.1.1 20000
memory_block.reallocate: (nil)
5.1.3 returning buffer 0x5c9575bf7820
memory_block.reallocate: (nil)
5.1.1 20000
new memory_block: (nil)
memory_block.reallocate: (nil)
5.1.3 returning buffer 0x5c9575bf78d0
5.1.1 20000
memory_block.reallocate: (nil)
5.1.3 returning buffer 0x5c9575bf79b0
dev_a after cudaMalloc: 0x7229a1bc2000
dev_b after cudaMalloc: 0x7229a1bbd000
dev_c after cudaMalloc: 0x7229a1bb8000
1
2
3
4
5.1 20000
5.1.1 20000
5.1.2 found buffer 2 0x5c9575bf78d0
5.1.3 returning buffer 0x5c9575bf78d0
5.2 (nil) 0x7ffc472f8b80 20000
Segmentation fault

Sep 28 '25 20:09 conduition

It seems strange to me that you encounter issues with the basic examples. Vuda is super powerful and relatively stable. I myself used it together with vkFFT for quite complicated quantum cryptography algorithms a few years ago. It seems likely that something with your environment is breaking it. I recommend you try an older Vulkan version from early 2021 just in case newer versions broke Vuda. I also recommend you try my Vuda fork under https://github.com/nicoboss/PrivacyAmplification/tree/master/PrivacyAmplification/vuda in which I fixed many issues I encountered with the original code including some Vuda memory management and alignment issues but I would not expect them to make a difference for the basic examples. Maybe try to compile and run my PrivacyAmplification project which uses Vuda as its foundation for the Vulkan version to see if it works on your environment. I last compiled it a few months ago and everything still worked flawlessly for me.

Sep 29 '25 20:09 nicoboss

The oldest SDK version available from lunar G is 1.3, which i just tried. Specifically I used 1.3.268.0 (released Oct 2023). It still segfaults the exact same way.

If there is some specific archived version of Vulkan SDK that Vuda depends upon, then the wiki specify which version it requires and how to acquire it? Right now the wiki just says this:

The only requirements for developing with the VUDA library is to have access to a Vulkan compatible system and install the Vulkan SDK. That said, the main tested targets of VUDA are Windows, Linux and Mac OS (through MoltenVK).

Oct 02 '25 18:10 conduition

As for my environment, I'm on vanilla Debian 12, no fancy extras, with build-essential installed along with the basic deps required by the Vulkan SDK. I'm compiling with g++. I've also tried this on a completely different physical machine (also debian) and got the same result.

Oct 02 '25 18:10 conduition

I just tried creating a clean development environment on clean Ubuntu 24.04.2 LTS and it is doesn't seem to work for me as well. I'm getting std::out_of_range on vuda::detail::logical_device::GetPool (tid=..., this=) at vuda/state/logicaldevice.inl:158 but could also be an issue with my setup.

If there is some specific archived version of Vulkan SDK that Vuda depends upon, then the wiki specify which version it requires and how to acquire it?

On my working development setup, I'm using VulkanSDK 1.2.176.1 and glslang master from 9th April 2021. This is working for me on both Windows and Linux and for booth for NVidia and AMD GPUs. Newer versions might work as well as there is no particular reason why I decided to use thouse specific versions for my PrivacyAmplification project.

The only requirements for developing with the VUDA library is to have access to a Vulkan compatible system and install the Vulkan SDK. That said, the main tested targets of VUDA are Windows, Linux and Mac OS (through MoltenVK).

You have to also consider when this statement was made. Back then Vuda worked on latest VulkanSDK but I wouldn’t just blindly assume that this is still the case.

Oct 02 '25 20:10 nicoboss

@conduition There now is https://github.com/jgbit/vuda/pull/30 which fixes Vuda support for latest Vulkan 1.4.328.1 (released on 8th October 2025) - Please give it a try and see if it fixes your issue.

Oct 26 '25 16:10 nicoboss

That branch appears to fix the fatal compilation errors, but I still get the segmentation fault in the simple example, exactly as described earlier. Most of the other samples also segfault but possibly in different ways.

Oct 27 '25 04:10 conduition