bpftime [BUG]launchlat example always shows 0 latency

Describe the bug When running the example/gpu/launchlate code the v0.2.0 release, I'm always seeing a latency of 0. This occurs when a GPU is idle and even when there are other processes running concurrently on the GPU.

To Reproduce

Checkout v0.2.0
Compile bpftime with cuda support (docs)
Follow the steps in the launchlat README

Expected behavior That the launch latency varies over time.

Logs

The logs below show the 0 latency case described above.

vec_add kernel logs

dev@hacksaw:~/code/bpftime ((v0.2.0)) $ sudo env "PATH=$PATH" BPFTIME_LOG_OUTPUT=console LD_PRELOAD=build/runtime/agent/libbpftime-agent.so  example/gpu/launchlate/vec_add
[2025-10-28 15:30:12.987] [info] [bpftime_shm_internal.cpp:928] Registered shared memory with CUDA: addr=0x7f7cc3197000 size=20971520
[2025-10-28 15:30:12.987] [info] [bpftime_shm_internal.cpp:756] Global shm constructed. shm_open_type 1 for bpftime_maps_shm
[2025-10-28 15:30:12.987] [info] [bpftime_shm_internal.cpp:42] Global shm initialized
[2025-10-28 15:30:12][info][707013] Initializing CUDA shared memory
[2025-10-28 15:30:13][info][707013] CUDA context created
[2025-10-28 15:30:13][info][707013] bpf_attach_ctx constructed
[2025-10-28 15:30:13][info][707045] CUDA watcher thread started
[2025-10-28 15:30:13][info][707013] Register attach-impl defined helper bpf_get_func_arg, index 183
[2025-10-28 15:30:13][info][707013] Register attach-impl defined helper bpf_get_func_ret_id, index 184
[2025-10-28 15:30:13][info][707013] Register attach-impl defined helper bpf_get_retval, index 186
[2025-10-28 15:30:13][info][707013] Starting nv_attach_impl
[2025-10-28 15:30:13][info][707013] Initializing agent..
[2025-10-28 15:30:13][info][707013] Executable path: /home/dev/code/bpftime/example/gpu/launchlate/vec_add
[2025-10-28 15:30:13][info][707013] Skipping nv attach handler 35 since we are not handling nv handles
[2025-10-28 15:30:13][info][707013] Main initializing for handlers done, try to initialize cuda link handles....
[2025-10-28 15:30:13][info][707013] Handling link to CUDA program: 35, recording it..
[2025-10-28 15:30:13][info][707013] Loaded 131 instructions (original) for cuda ebpf program
[2025-10-28 15:30:13][warning][707013] Not a GPU map!
[2025-10-28 15:30:13][info][707013] Copying map fd 20 to device, key size=4, value size=8, max ent=1, map_type = 2, gpu_buffer = 0
[2025-10-28 15:30:13][warning][707013] Not a GPU map!
[2025-10-28 15:30:13][info][707013] Copying map fd 21 to device, key size=4, value size=8, max ent=10, map_type = 2, gpu_buffer = 0
[2025-10-28 15:30:13][warning][707013] Not a GPU map!
[2025-10-28 15:30:13][info][707013] Copying map fd 22 to device, key size=4, value size=8, max ent=1, map_type = 2, gpu_buffer = 0
[2025-10-28 15:30:13][warning][707013] Not a GPU map!
[2025-10-28 15:30:13][info][707013] Copying map fd 23 to device, key size=4, value size=193, max ent=1, map_type = 2, gpu_buffer = 0
[2025-10-28 15:30:13][info][707013] Recording kprobe for _Z9vectorAddPKfS0_Pf
[2025-10-28 15:30:13][info][707013] Attach successfully
[2025-10-28 15:30:13][info][707013] Got CUBIN section header size = 16, size = 3208
[2025-10-28 15:30:13][info][707013] Finally size = 3224
[2025-10-28 15:30:13][info][707013] Temporary fatbin written to /tmp/bpftime-fatbin-store.Awdkqi/temp.fatbin
[2025-10-28 15:30:13][info][707013] Listing functions in the patched ptx
[2025-10-28 15:30:14][info][707013] Extracted PTX at /tmp/bpftime-fatbin-store.Awdkqi/temp.ptx
[2025-10-28 15:30:14][info][707013] Patching with kprobe/kretprobe
[2025-10-28 15:30:14][info][707013] Compiling eBPF to PTX __probe_func___Z9vectorAddPKfS0_Pf, eBPF instructions count 131, with arguments false
[2025-10-28 15:30:14][info][707013] Recompiling PTX with nvcc..
[2025-10-28 15:30:14][info][707013] Working directory: /tmp/bpftime-recompile-nvcc
[2025-10-28 15:30:14][info][707013] PTX IN: /tmp/bpftime-recompile-nvcc/main.ptx
[2025-10-28 15:30:14][info][707013] Fatbin out /tmp/bpftime-recompile-nvcc/out.fatbin
[2025-10-28 15:30:14][info][707013] Starting nvcc: nvcc -O2 -G -g --keep-device-functions -arch=sm_60 /tmp/bpftime-recompile-nvcc/main.ptx -fatbin -o /tmp/bpftime-recompile-nvcc/out.fatbin
nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
ptxas warning :  .debug_abbrev section not found
ptxas warning :  .debug_info section not found
[2025-10-28 15:30:14][info][707013] NVCC execution done.
[2025-10-28 15:30:14][info][707013] Got patched fatbin in 71976 bytes
[2025-10-28 15:30:14][info][707013] Registering trampoline memory
[2025-10-28 15:30:14][info][707013] Register trampoline memory done
[2025-10-28 15:30:14][info][707013] Copying data to device symbols..
[2025-10-28 15:30:14][info][707013] Copying the followed map info:
[2025-10-28 15:30:14][info][707013] Mapid 20, enabled = true, key_size = 4, value_size = 8, max_ent=1, type=2
[2025-10-28 15:30:14][info][707013] Mapid 21, enabled = true, key_size = 4, value_size = 8, max_ent=10, type=2
[2025-10-28 15:30:14][info][707013] Mapid 22, enabled = true, key_size = 4, value_size = 8, max_ent=1, type=2
[2025-10-28 15:30:14][info][707013] Mapid 23, enabled = true, key_size = 4, value_size = 193, max_ent=1, type=2
[2025-10-28 15:30:14][info][707013] constData and map_basic_info copied..
CPU: cudaLaunchKernel called at ts=351641991221266 ns (calibrated=1761665414777835473), pid=707013
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665411182763456
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665411182763456
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665411182763456
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665411182763456
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665411182763456
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665411182763456
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665411182763456
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665411182763456
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665411182763456
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665411182763456
C[0] = 0 (expected 0)
C[1] = 3 (expected 3)
CPU: cudaLaunchKernel called at ts=351643023688824 ns (calibrated=1761665415810303031), pid=707013
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665412214724512
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665412214724512
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665412214724512
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665412214724512
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665412214724512
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665412214724512
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665412214724512
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665412214724512
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665412214724512
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665412214724512
C[0] = 0 (expected 0)
C[1] = 3 (expected 3)
CPU: cudaLaunchKernel called at ts=351644055795356 ns (calibrated=1761665416842409563), pid=707013
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665413246826592
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665413246826592
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665413246826592
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665413246826592
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665413246826592
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665413246826592
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665413246826592
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665413246826592
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665413246826592
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665413246826592
C[0] = 0 (expected 0)
C[1] = 3 (expected 3)
CPU: cudaLaunchKernel called at ts=351645087870721 ns (calibrated=1761665417874484928), pid=707013
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665414278881472
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665414278881472
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665414278881472
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665414278881472
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665414278881472
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665414278881472
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665414278881472
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665414278881472
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665414278881472
GPU: Kernel executing, latency=0 ns (bin=0), gpu_ts=1761665414278881472
C[0] = 0 (expected 0)
C[1] = 3 (expected 3)
CPU: cudaLaunchKernel called at ts=351646120079878 ns (calibrated=1761665418906694085), pid=707013

syscall server logs

dev@hacksaw:~/code/bpftime ((v0.2.0)) $ sudo BPFTIME_LOG_OUTPUT=console LD_PRELOAD=build/runtime/syscall-server/libbpftime-syscall-server.so  example/gpu/launchlate/launchlate
libbpf: loading object 'launchlate_bpf' from buffer
[2025-10-28 15:29:58.603] [info] [syscall_context.cpp:87] Init bpftime syscall mocking..
[2025-10-28 15:29:58.603] [info] [syscall_context.cpp:89] The log will be written to: console
[2025-10-28 15:29:58.603] [info] [syscall_context.cpp:92] Initializing CUDA at syscall-server side
[2025-10-28 15:29:58.639] [info] [syscall_context.cpp:101] CUDA initialized at syscall server side
[2025-10-28 15:29:58.639] [info] [syscall_server_utils.cpp:43] Starting syscall server..
[2025-10-28 15:29:58][info][706552] Initialize syscall server
[2025-10-28 15:29:58][info][706552] Global shm constructed. shm_open_type 0 for bpftime_maps_shm
[2025-10-28 15:29:58][info][706552] Global shm initialized
[2025-10-28 15:29:58][info][706552] bpftime-syscall-server started
libbpf: elf: section(2) .symtab, size 432, link 1, flags 0, type=2
libbpf: elf: section(3) uprobe/example/gpu/launchlate/vec_add:_Z16cudaLaunchKernelIcE9cudaErrorPT_4dim3S3_PPvmP11CUstream_st, size 480, link 0, flags 6, type=1
libbpf: sec 'uprobe/example/gpu/launchlate/vec_add:_Z16cudaLaunchKernelIcE9cudaErrorPT_4dim3S3_PPvmP11CUstream_st': found program 'uprobe_cuda_launch' at insn offset 0 (0 bytes), code size 60 insns (480 bytes)
libbpf: elf: section(4) kprobe/_Z9vectorAddPKfS0_Pf, size 1048, link 0, flags 6, type=1
libbpf: sec 'kprobe/_Z9vectorAddPKfS0_Pf': found program 'probe__cuda' at insn offset 0 (0 bytes), code size 131 insns (1048 bytes)
libbpf: elf: section(5) .maps, size 96, link 0, flags 3, type=1
libbpf: elf: section(6) .rodata.str1.1, size 193, link 0, flags 32, type=1
libbpf: elf: section(7) license, size 4, link 0, flags 3, type=1
libbpf: license of launchlate_bpf is GPL
libbpf: elf: section(8) .reluprobe/example/gpu/launchlate/vec_add:_Z16cudaLaunchKernelIcE9cudaErrorPT_4dim3S3_PPvmP11CUstream_st, size 32, link 2, flags 40, type=9
libbpf: elf: section(9) .relkprobe/_Z9vectorAddPKfS0_Pf, size 48, link 2, flags 40, type=9
libbpf: elf: section(10) .BTF, size 3034, link 0, flags 0, type=1
libbpf: elf: section(11) .BTF.ext, size 824, link 0, flags 0, type=1
libbpf: looking for externs among 18 symbols...
libbpf: collected 0 externs total
libbpf: map 'last_uprobe_time': at sec_idx 5, offset 0.
libbpf: map 'last_uprobe_time': found type = 2.
libbpf: map 'last_uprobe_time': found key [8], sz = 4.
libbpf: map 'last_uprobe_time': found value [12], sz = 8.
libbpf: map 'last_uprobe_time': found max_entries = 1.
libbpf: map 'time_histogram': at sec_idx 5, offset 32.
libbpf: map 'time_histogram': found type = 2.
libbpf: map 'time_histogram': found key [8], sz = 4.
libbpf: map 'time_histogram': found value [12], sz = 8.
libbpf: map 'time_histogram': found max_entries = 10.
libbpf: map 'clock_offset': at sec_idx 5, offset 64.
libbpf: map 'clock_offset': found type = 2.
libbpf: map 'clock_offset': found key [8], sz = 4.
libbpf: map 'clock_offset': found value [22], sz = 8.
libbpf: map 'clock_offset': found max_entries = 1.
libbpf: map '.rodata.str1.1' (global data): at sec_idx 6, offset 0, flags 80.
libbpf: map 3 is ".rodata.str1.1"
libbpf: sec '.reluprobe/example/gpu/launchlate/vec_add:_Z16cudaLaunchKernelIcE9cudaErrorPT_4dim3S3_PPvmP11CUstream_st': collecting relocation for section(3) 'uprobe/example/gpu/launchlate/vec_add:_Z16cudaLaunchKernelIcE9cudaErrorPT_4dim3S3_PPvmP11CUstream_st'
libbpf: sec '.reluprobe/example/gpu/launchlate/vec_add:_Z16cudaLaunchKernelIcE9cudaErrorPT_4dim3S3_PPvmP11CUstream_st': relo #0: insn #9 against 'clock_offset'
libbpf: prog 'uprobe_cuda_launch': found map 2 (clock_offset, sec 5, off 64) for insn #9
libbpf: sec '.reluprobe/example/gpu/launchlate/vec_add:_Z16cudaLaunchKernelIcE9cudaErrorPT_4dim3S3_PPvmP11CUstream_st': relo #1: insn #20 against 'last_uprobe_time'
libbpf: prog 'uprobe_cuda_launch': found map 0 (last_uprobe_time, sec 5, off 0) for insn #20
libbpf: sec '.relkprobe/_Z9vectorAddPKfS0_Pf': collecting relocation for section(4) 'kprobe/_Z9vectorAddPKfS0_Pf'
libbpf: sec '.relkprobe/_Z9vectorAddPKfS0_Pf': relo #0: insn #6 against 'last_uprobe_time'
libbpf: prog 'probe__cuda': found map 0 (last_uprobe_time, sec 5, off 0) for insn #6
libbpf: sec '.relkprobe/_Z9vectorAddPKfS0_Pf': relo #1: insn #51 against 'time_histogram'
libbpf: prog 'probe__cuda': found map 1 (time_histogram, sec 5, off 32) for insn #51
libbpf: sec '.relkprobe/_Z9vectorAddPKfS0_Pf': relo #2: insn #95 against 'time_histogram'
libbpf: prog 'probe__cuda': found map 1 (time_histogram, sec 5, off 32) for insn #95
libbpf: object 'launchlate_bpf': failed (-22) to create BPF token from '/sys/fs/bpf', skipping optional step...
libbpf: map 'last_uprobe_time': created successfully, fd=20
libbpf: map 'time_histogram': created successfully, fd=21
libbpf: map 'clock_offset': created successfully, fd=22
libbpf: map '.rodata.str1.1': created successfully, fd=23
Clock calibration: REALTIME - MONOTONIC = 1761313772786614207 ns
  MONOTONIC: 351625.856392649
  REALTIME:  1761665398.643006856
libbpf: elf: symbol address match for '_Z16cudaLaunchKernelIcE9cudaErrorPT_4dim3S3_PPvmP11CUstream_st' in 'example/gpu/launchlate/vec_add': 0x1c64
[2025-10-28 15:29:58][info][706552] Created uprobe/uretprobe perf event handler, module name example/gpu/launchlate/vec_add, offset 1c64
[2025-10-28 15:29:58][info][706552] Created kprobe/kretprobe perf event handler, func_name _Z9vectorAddPKfS0_Pf, addr 0

Monitoring CUDA kernel launch latency... Hit Ctrl-C to end.

15:30:00  Launch Latency Distribution:
latency         : count    distribution
Total samples: 0

15:30:02  Launch Latency Distribution:
latency         : count    distribution
Total samples: 0

15:30:04  Launch Latency Distribution:
latency         : count    distribution
Total samples: 0

15:30:06  Launch Latency Distribution:
latency         : count    distribution
Total samples: 0

15:30:08  Launch Latency Distribution:
latency         : count    distribution
Total samples: 0

15:30:10  Launch Latency Distribution:
latency         : count    distribution
Total samples: 0

15:30:12  Launch Latency Distribution:
latency         : count    distribution
Total samples: 0

15:30:14  Launch Latency Distribution:
latency         : count    distribution
Total samples: 0

15:30:16  Launch Latency Distribution:
latency         : count    distribution
0-100ns         : 20       |****************************************
Total samples: 20

15:30:18  Launch Latency Distribution:
latency         : count    distribution
0-100ns         : 40       |****************************************
Total samples: 40

^C
cudaHostUnregister() failed: CUDA driver version is insufficient for CUDA runtime version

Desktop (please complete the following information):

OS: Ubuntu jammy (22.04)
Compiler: clang 16.0.4
Cuda version: 12.9

Additional context

In order to understand why the delta always returns 0, I tried to inspect the value of uprobe_ts by adding it to the following bpf_printk. This causes the vec_add kernel binary to fail to load and fails when bpftime tries to inject its ptx code. Please see the logs from the syscall server and the example below when the following code change was applied:

diff --git a/example/gpu/launchlate/launchlate.bpf.c b/example/gpu/launchlate/launchlate.bpf.c
index 3740031..1ccbd05 100644
--- a/example/gpu/launchlate/launchlate.bpf.c
+++ b/example/gpu/launchlate/launchlate.bpf.c
@@ -109,8 +109,8 @@ int probe__cuda()
                        bpf_map_update_elem(&time_histogram, &bin, &one, BPF_NOEXIST);
                }

-               bpf_printk("GPU: Kernel executing, latency=%lu ns (bin=%u), gpu_ts=%lu\n",
-                          delta_ns, bin, gpu_ts);
+               bpf_printk("GPU: Kernel executing, latency=%lu ns (bin=%u), gpu_ts=%lu uprobe_ts=%lu\n",
+                          delta_ns, bin, gpu_ts, *uprobe_ts);
        } else {
                bpf_printk("GPU: Kernel executing but no uprobe timestamp found, gpu_ts=%lu\n", gpu_ts);
        }

vec_add kernel failure

dev@hacksaw:~/code/bpftime ((v0.2.0)) $ sudo env "PATH=$PATH" BPFTIME_LOG_OUTPUT=console LD_PRELOAD=build/runtime/agent/libbpftime-agent.so  example/gpu/launchlate/vec_add
[2025-10-28 15:40:51.246] [info] [bpftime_shm_internal.cpp:928] Registered shared memory with CUDA: addr=0x7f07a8f60000 size=20971520
[2025-10-28 15:40:51.246] [info] [bpftime_shm_internal.cpp:756] Global shm constructed. shm_open_type 1 for bpftime_maps_shm
[2025-10-28 15:40:51.246] [info] [bpftime_shm_internal.cpp:42] Global shm initialized
[2025-10-28 15:40:51][info][724006] Initializing CUDA shared memory
[2025-10-28 15:40:52][info][724006] CUDA context created
[2025-10-28 15:40:52][info][724006] bpf_attach_ctx constructed
[2025-10-28 15:40:52][info][724026] CUDA watcher thread started
[2025-10-28 15:40:52][info][724006] Register attach-impl defined helper bpf_get_func_arg, index 183
[2025-10-28 15:40:52][info][724006] Register attach-impl defined helper bpf_get_func_ret_id, index 184
[2025-10-28 15:40:52][info][724006] Register attach-impl defined helper bpf_get_retval, index 186
[2025-10-28 15:40:52][info][724006] Starting nv_attach_impl
[2025-10-28 15:40:52][info][724006] Initializing agent..
[2025-10-28 15:40:52][info][724006] Executable path: /home/dev/code/bpftime/example/gpu/launchlate/vec_add
[2025-10-28 15:40:52][info][724006] Skipping nv attach handler 36 since we are not handling nv handles
[2025-10-28 15:40:52][info][724006] Main initializing for handlers done, try to initialize cuda link handles....
[2025-10-28 15:40:52][info][724006] Handling link to CUDA program: 36, recording it..
[2025-10-28 15:40:52][info][724006] Loaded 115 instructions (original) for cuda ebpf program
[2025-10-28 15:40:52][warning][724006] Not a GPU map!
[2025-10-28 15:40:52][info][724006] Copying map fd 20 to device, key size=4, value size=8, max ent=1, map_type = 2, gpu_buffer = 0
[2025-10-28 15:40:52][warning][724006] Not a GPU map!
[2025-10-28 15:40:52][info][724006] Copying map fd 21 to device, key size=4, value size=8, max ent=10, map_type = 2, gpu_buffer = 0
[2025-10-28 15:40:52][warning][724006] Not a GPU map!
[2025-10-28 15:40:52][info][724006] Copying map fd 22 to device, key size=4, value size=8, max ent=1, map_type = 2, gpu_buffer = 0
[2025-10-28 15:40:52][warning][724006] Not a GPU map!
[2025-10-28 15:40:52][info][724006] Copying map fd 23 to device, key size=4, value size=74, max ent=1, map_type = 2, gpu_buffer = 0
[2025-10-28 15:40:52][warning][724006] Not a GPU map!
[2025-10-28 15:40:52][info][724006] Copying map fd 24 to device, key size=4, value size=133, max ent=1, map_type = 2, gpu_buffer = 0
[2025-10-28 15:40:52][info][724006] Recording kprobe for _Z9vectorAddPKfS0_Pf
[2025-10-28 15:40:52][info][724006] Attach successfully
[2025-10-28 15:40:52][info][724006] Got CUBIN section header size = 16, size = 3208
[2025-10-28 15:40:52][info][724006] Finally size = 3224
[2025-10-28 15:40:52][info][724006] Temporary fatbin written to /tmp/bpftime-fatbin-store.NubRRX/temp.fatbin
[2025-10-28 15:40:52][info][724006] Listing functions in the patched ptx
[2025-10-28 15:40:52][info][724006] Extracted PTX at /tmp/bpftime-fatbin-store.NubRRX/temp.ptx
[2025-10-28 15:40:52][info][724006] Patching with kprobe/kretprobe
[2025-10-28 15:40:52][info][724006] Compiling eBPF to PTX __probe_func___Z9vectorAddPKfS0_Pf, eBPF instructions count 115, with arguments false
map_val is not provided, unable to compile at pc 108
cudaHostUnregister() failed: driver shutting down

Oct 28 '25 15:10 ddelnano

I also tried the example from the most recent main commit (d561e85c1650f932d5a56a49cc4108019eefa51d). I ran into issues compiling bpftime that seem related to the recent #479, so I wasn't able to fully test if a more recent version fixed these issues.

Oct 28 '25 15:10 ddelnano

We fixed the issue in #479 and you can give another try.

Nov 03 '25 12:11 Sy0307

After pulling in those changes, I'm now hitting #488. I'll watch that issue and retest once that fix is available.

Nov 04 '25 00:11 ddelnano

We have fixed this issue #488 in PR #489. Try it again and feedback if you have any issue

Nov 04 '25 04:11 Sy0307

@Sy0307 thanks for the update. While the program runs again, I'm still seeing that the delta_ns values returns 0.

It seems that the gpu timestamp is far behind the uprobe_ts timestamp. How is the clock offset map supposed to be populated?

Code change

diff --git a/example/gpu/launchlate/launchlate.bpf.c b/example/gpu/launchlate/launchlate.bpf.c
index 2f14eb2..da8b445 100644
--- a/example/gpu/launchlate/launchlate.bpf.c
+++ b/example/gpu/launchlate/launchlate.bpf.c
@@ -93,6 +93,8 @@ int cuda__probe()

        // Get the last uprobe timestamp
        uprobe_ts = bpf_map_lookup_elem(&last_uprobe_time, &key);
+        bpf_printk("GPU: uprobe_ts=%p, *uprobe_ts=%lu gpu_ts=%lu\n",
+                   uprobe_ts, *uprobe_ts, gpu_ts);
        if (uprobe_ts && *uprobe_ts > 0) {
                // Calculate time difference between uprobe and GPU execution
                if (gpu_ts > *uprobe_ts) {

vec_add logs

$ BPFTIME_LOG_OUTPUT=console LD_PRELOAD=build/runtime/agent/libbpftime-agent.so  example/gpu/launchlate/vec_add

[ ... ]

[2025-11-04 14:25:28][debug][23954] CUDA: Executing bpf_printk: GPU: uprobe_ts=%p, *uprobe_ts=%lu gpu_ts=%lu
, arg1=140297605761344, arg2=1762266328355461513, arg3=1762266319190245312
[2025-11-04 14:25:28][debug][23954] CUDA Received call request id 6, map_ptr = 0, map_fd = 0
GPU: uprobe_ts=0x7f9994f44940, *uprobe_ts=1762266328355461513 gpu_ts=1762266319190245312
[2025-11-04 14:25:28][debug][23954] CUDA: Executing bpf_printk: GPU: uprobe_ts=%p, *uprobe_ts=%lu gpu_ts=%lu
, arg1=140297605761344, arg2=1762266328355461513, arg3=1762266319190245312

Nov 04 '25 14:11 ddelnano

Test again after the fix in #495 ?

Nov 10 '25 18:11 Sy0307

@Sy0307 thanks for all the fixes!

I retested against 11da28d4be264f5bb45f0cd33c5b1cfeedbbd192. The C[0] and C[1] text match their expected values now, but I'm still seeing that there isn't any latency reported:

[2025-11-11 05:35:18][debug][2451871] CUDA: Executing bpf_printk: GPU: Kernel executing, latency=%lu ns (bin=%u), gpu_ts=%lu
, arg1=0, arg2=0, arg3=1762839304858547936
C[0] = 0 (expected 0)
C[1] = 3 (expected 3)

I assume bpftime should be populating the clock_offset map behind the scenes? How is that mechanism supposed to work?

Nov 11 '25 05:11 ddelnano

Fixed in #501 .

Nov 24 '25 12:11 Sy0307