luminal icon indicating copy to clipboard operation
luminal copied to clipboard

CUDA Context Fix

Open amemov opened this issue 2 months ago • 1 comments

  • Added shared CUDA context (OnceLock) and use it across luminal_cud and luminal_2
  • Replaced all CudaContext::new(0) with shared context; fix u32→usize cast

Still need to check the correctness. Was going to do it yesterday, but Lambda node got closed when I was about to continue working on it

TODO: Verify the statement above, remove the comments (my thoughts on some elements in the code)

amemov avatar Nov 05 '25 01:11 amemov

Tried running it locally and now I have this weird log with matmul example, which I didn't have on Lambda node with H100. Same log gets produced with changes in this PR and without them:

thread 'main' (222371) panicked at /home/anton/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.16.6/src/nvrtc/sys/mod.rs:579:18:
Expected symbol in library: DlSym { desc: "/opt/cuda/lib64/libnvrtc.so: undefined symbol: nvrtcGetNVVM" }
stack backtrace:
   0:     0x55cc31a7bf12 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::he2aba4f1d4ea1fbd
   1:     0x55cc31a8f4ef - core::fmt::write::h5b6d723e88f3973a
   2:     0x55cc31a4b151 - std::io::Write::write_fmt::hfb264c83805b0a19
   3:     0x55cc31a57442 - std::sys::backtrace::BacktraceLock::print::ha2f5782cdbcccd40
   4:     0x55cc31a5d0ec - std::panicking::default_hook::{{closure}}::ha8f0b2a22fbd9290
   5:     0x55cc31a5cf46 - std::panicking::default_hook::he3ca409e17c78e5f
   6:     0x55cc31a5d775 - std::panicking::panic_with_hook::h64da284505672a54
   7:     0x55cc31a5d60a - std::panicking::panic_handler::{{closure}}::hed6200fceb2a07a8
   8:     0x55cc31a57579 - std::sys::backtrace::__rust_end_short_backtrace::h7e403b75b11a9d15
   9:     0x55cc31a3e7dd - __rustc[3d1ee1440eab7a60]::rust_begin_unwind
  10:     0x55cc30d69770 - core::panicking::panic_fmt::h366dbf4a636b0c49
  11:     0x55cc30d69246 - core::result::unwrap_failed::h9547866b9642f875
  12:     0x55cc31a380cb - cudarc::nvrtc::sys::loaded::Lib::new::h1082bc3423ee8e35
  13:     0x55cc31a3566c - std::sync::poison::once::Once::call_once_force::{{closure}}::h7e6e354eac643a8a
  14:     0x55cc30d63617 - std::sys::sync::once::futex::Once::call::hf42d57099a0be3fe
  15:     0x55cc30d62765 - std::sync::once_lock::OnceLock<T>::initialize::hece25be20994c1b6
  16:     0x55cc31a36bb9 - cudarc::nvrtc::result::create_program::h82f243cbc1f0b495
  17:     0x55cc30de1e1d - cudarc::nvrtc::safe::compile_ptx_with_opts::hb76741ce5230d5ce
  18:     0x55cc30db0fcb - luminal_2::run::compile_kernels::hf29cc249943d02cd
  19:     0x55cc30dec1dd - luminal_2::extract::cost::h739fe2cbb9472171
  20:     0x55cc30ded095 - luminal_2::extract::search::h47d87febca593ef1
  21:     0x55cc30d6f3cc - matmul::main::hc201344469d6905e
  22:     0x55cc30d74733 - std::sys::backtrace::__rust_begin_short_backtrace::hd24bcec9c915583c
  23:     0x55cc30d75ed9 - std::rt::lang_start::{{closure}}::h21aaa901334c090d
  24:     0x55cc31a4cb50 - std::rt::lang_start_internal::h416e1497f666f6ed
  25:     0x55cc30d72685 - main
  26:     0x7fcec5027675 - <unknown>
  27:     0x7fcec5027729 - __libc_start_main
  28:     0x55cc30d69795 - _start
  29:                0x0 - <unknown>

amemov avatar Nov 05 '25 23:11 amemov