luminal
luminal copied to clipboard
CUDA Context Fix
- Added shared CUDA context (
OnceLock) and use it across luminal_cud and luminal_2 - Replaced all
CudaContext::new(0)with shared context; fix u32→usize cast
Still need to check the correctness. Was going to do it yesterday, but Lambda node got closed when I was about to continue working on it
TODO: Verify the statement above, remove the comments (my thoughts on some elements in the code)
Tried running it locally and now I have this weird log with matmul example, which I didn't have on Lambda node with H100. Same log gets produced with changes in this PR and without them:
thread 'main' (222371) panicked at /home/anton/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.16.6/src/nvrtc/sys/mod.rs:579:18:
Expected symbol in library: DlSym { desc: "/opt/cuda/lib64/libnvrtc.so: undefined symbol: nvrtcGetNVVM" }
stack backtrace:
0: 0x55cc31a7bf12 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::he2aba4f1d4ea1fbd
1: 0x55cc31a8f4ef - core::fmt::write::h5b6d723e88f3973a
2: 0x55cc31a4b151 - std::io::Write::write_fmt::hfb264c83805b0a19
3: 0x55cc31a57442 - std::sys::backtrace::BacktraceLock::print::ha2f5782cdbcccd40
4: 0x55cc31a5d0ec - std::panicking::default_hook::{{closure}}::ha8f0b2a22fbd9290
5: 0x55cc31a5cf46 - std::panicking::default_hook::he3ca409e17c78e5f
6: 0x55cc31a5d775 - std::panicking::panic_with_hook::h64da284505672a54
7: 0x55cc31a5d60a - std::panicking::panic_handler::{{closure}}::hed6200fceb2a07a8
8: 0x55cc31a57579 - std::sys::backtrace::__rust_end_short_backtrace::h7e403b75b11a9d15
9: 0x55cc31a3e7dd - __rustc[3d1ee1440eab7a60]::rust_begin_unwind
10: 0x55cc30d69770 - core::panicking::panic_fmt::h366dbf4a636b0c49
11: 0x55cc30d69246 - core::result::unwrap_failed::h9547866b9642f875
12: 0x55cc31a380cb - cudarc::nvrtc::sys::loaded::Lib::new::h1082bc3423ee8e35
13: 0x55cc31a3566c - std::sync::poison::once::Once::call_once_force::{{closure}}::h7e6e354eac643a8a
14: 0x55cc30d63617 - std::sys::sync::once::futex::Once::call::hf42d57099a0be3fe
15: 0x55cc30d62765 - std::sync::once_lock::OnceLock<T>::initialize::hece25be20994c1b6
16: 0x55cc31a36bb9 - cudarc::nvrtc::result::create_program::h82f243cbc1f0b495
17: 0x55cc30de1e1d - cudarc::nvrtc::safe::compile_ptx_with_opts::hb76741ce5230d5ce
18: 0x55cc30db0fcb - luminal_2::run::compile_kernels::hf29cc249943d02cd
19: 0x55cc30dec1dd - luminal_2::extract::cost::h739fe2cbb9472171
20: 0x55cc30ded095 - luminal_2::extract::search::h47d87febca593ef1
21: 0x55cc30d6f3cc - matmul::main::hc201344469d6905e
22: 0x55cc30d74733 - std::sys::backtrace::__rust_begin_short_backtrace::hd24bcec9c915583c
23: 0x55cc30d75ed9 - std::rt::lang_start::{{closure}}::h21aaa901334c090d
24: 0x55cc31a4cb50 - std::rt::lang_start_internal::h416e1497f666f6ed
25: 0x55cc30d72685 - main
26: 0x7fcec5027675 - <unknown>
27: 0x7fcec5027729 - __libc_start_main
28: 0x55cc30d69795 - _start
29: 0x0 - <unknown>