candle icon indicating copy to clipboard operation
candle copied to clipboard

Cannot run examples with --features cuda option

Open dbrowne opened this issue 2 years ago • 54 comments
trafficstars

CARGO_PROFILE_RELEASE_BUILD_OVERRIDE_DEBUG=true warning: some crates are on edition 2021 which defaults to resolver = "2", but virtual workspaces default to resolver = "1" note: to keep the current resolver, specify workspace.resolver = "1" in the workspace root's manifest note: to use the edition 2021 resolver, specify workspace.resolver = "2" in the workspace root's manifest Compiling libc v0.2.147 Compiling autocfg v1.1.0 Compiling crossbeam-utils v0.8.16 Compiling proc-macro2 v1.0.66 Compiling unicode-ident v1.0.11 Compiling rayon-core v1.11.0 Compiling memchr v2.5.0 Compiling libm v0.2.7 Compiling cfg-if v1.0.0 Compiling pkg-config v0.3.27 Compiling paste v1.0.14 Compiling serde v1.0.183 Compiling serde_derive v1.0.183 Compiling scopeguard v1.2.0 Compiling syn v1.0.109 Compiling serde_json v1.0.104 Compiling seq-macro v0.3.5 Compiling vcpkg v0.2.15 Compiling crc32fast v1.3.2 Compiling ident_case v1.0.1 Compiling strsim v0.10.0 Compiling fnv v1.0.7 Compiling thiserror v1.0.44 Compiling either v1.9.0 Compiling glob v0.3.1 Compiling openssl v0.10.56 Compiling rustls v0.21.6 Compiling anyhow v1.0.72 Compiling cudarc v0.9.13 Compiling portable-atomic v1.4.2 Compiling native-tls v0.2.11 Compiling esaxx-rs v0.1.8 Compiling adler v1.0.2 Compiling rustix v0.38.7 Compiling gimli v0.27.3 Compiling macro_rules_attribute-proc_macro v0.1.3 Compiling rustc-demangle v0.1.23 Compiling miniz_oxide v0.7.1 Compiling heck v0.4.1 Compiling flate2 v1.0.26 Compiling memoffset v0.9.0 Compiling crossbeam-epoch v0.9.15 Compiling num-traits v0.2.16 Compiling zip v0.6.6 Compiling crossbeam-channel v0.5.8 Compiling aho-corasick v1.0.2 Compiling object v0.31.1 Compiling nom v7.1.3 Compiling aho-corasick v0.7.20 Compiling quote v1.0.32 Compiling macro_rules_attribute v0.1.3 Compiling syn v2.0.28 Compiling crossbeam-deque v0.8.3 Compiling num_cpus v1.16.0 Compiling getrandom v0.2.10 Compiling dirs-sys v0.4.1 Compiling console v0.15.7 Compiling memmap2 v0.7.1 Compiling regex-automata v0.3.6 Compiling cc v1.0.82 Compiling dirs v5.0.1 Compiling rand_core v0.6.4 Compiling num-complex v0.4.3 Compiling rand_chacha v0.3.1 Compiling indicatif v0.17.6 Compiling rand v0.8.5 Compiling addr2line v0.20.0 Compiling rayon v1.7.0 Compiling is-terminal v0.4.9 Compiling ring v0.16.20 Compiling openssl-sys v0.9.91 Compiling rand_distr v0.4.3 Compiling backtrace v0.3.68 Compiling onig_sys v69.8.1 Compiling anstream v0.3.2 Compiling clap_builder v4.3.21 Compiling half v2.3.1 Compiling spm_precompiled v0.1.4 Compiling regex v1.9.3 Compiling darling_core v0.14.4 Compiling fancy-regex v0.10.0 Compiling candle-kernels v0.1.0 (/mnt/source1/djbGR/ruststuffs/candle/candle-kernels) Compiling candle-gemm-common v0.15.5 Compiling rayon-cond v0.1.0 Compiling candle-gemm-f32 v0.15.5 Compiling candle-gemm-f64 v0.15.5 Compiling candle-gemm-c64 v0.15.5 Compiling candle-gemm-c32 v0.15.5 Compiling safetensors v0.3.2 Compiling candle-examples v0.1.0 (/mnt/source1/djbGR/ruststuffs/candle/candle-examples) Compiling tracing-chrome v0.7.1 Compiling candle-gemm-f16 v0.15.5 error: failed to run custom build command for candle-kernels v0.1.0 (/mnt/source1/djbGR/ruststuffs/candle/candle-kernels)

Caused by: process didn't exit successfully: /mnt/source1/djbGR/ruststuffs/candle/target/release/build/candle-kernels-e21ab5b8e8daaf0a/build-script-build (exit status: 101) --- stdout cargo:rerun-if-changed=build.rs cargo:rustc-env=CUDA_INCLUDE_DIR=/usr/local/cuda/include cargo:rerun-if-changed=src/ cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP cargo:rustc-env=CUDA_COMPUTE_CAP=sm_61

--- stderr src/compatibility.cuh(19): error: function "__hmax_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmax_nan(__half a, __half b) { ^

src/compatibility.cuh(22): error: function "__hmin_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmin_nan(__half a, __half b) { ^

src/compatibility.cuh(19): error: function "__hmax_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmax_nan(__half a, __half b) { ^

src/compatibility.cuh(22): error: function "__hmin_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmin_nan(__half a, __half b) { ^

src/compatibility.cuh(19): error: function "__hmax_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmax_nan(__half a, __half b) { ^

src/compatibility.cuh(22): error: function "__hmin_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmin_nan(__half a, __half b) { ^

src/compatibility.cuh(19): error: function "__hmax_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmax_nan(__half a, __half b) { ^

src/compatibility.cuh(22): error: function "__hmin_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmin_nan(__half a, __half b) { ^

src/compatibility.cuh(19): error: function "__hmax_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmax_nan(__half a, __half b) { ^

src/compatibility.cuh(22): error: function "__hmin_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmin_nan(__half a, __half b) { ^

2 errors detected in the compilation of "src/indexing.cu". src/compatibility.cuh(19): error: function "__hmax_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmax_nan(__half a, __half b) { ^

src/compatibility.cuh(22): error: function "__hmin_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmin_nan(__half a, __half b) { ^

2 errors detected in the compilation of "src/affine.cu". src/compatibility.cuh(19): error: function "__hmax_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmax_nan(__half a, __half b) { ^

src/compatibility.cuh(22): error: function "__hmin_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmin_nan(__half a, __half b) { ^

2 errors detected in the compilation of "src/cast.cu". 2 errors detected in the compilation of "src/reduce.cu". 2 errors detected in the compilation of "src/conv.cu". src/compatibility.cuh(19): error: function "__hmax_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmax_nan(__half a, __half b) { ^

src/compatibility.cuh(22): error: function "__hmin_nan(__half, __half)" has already been defined attribute((device)) inline attribute((always_inline)) __half __hmin_nan(__half a, __half b) { ^

2 errors detected in the compilation of "src/ternary.cu". 2 errors detected in the compilation of "src/unary.cu". 2 errors detected in the compilation of "src/binary.cu". thread 'main' panicked at 'nvcc error while compiling "src/affine.cu":

stdout

stderr

', candle-kernels/build.rs:207:13 stack backtrace: 0: 0x557f8498d0b1 - std::backtrace_rs::backtrace::libunwind::trace::hb01a67340c9cfb71 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5 1: 0x557f8498d0b1 - std::backtrace_rs::backtrace::trace_unsynchronized::h896aca561948c930 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 2: 0x557f8498d0b1 - std::sys_common::backtrace::_print_fmt::h8627be5b68fbde29 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:65:5 3: 0x557f8498d0b1 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1b7758da45f4cd22 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:44:22 4: 0x557f849b282c - core::fmt::rt::Argument::fmt::h0eb38586043a01ca at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/core/src/fmt/rt.rs:138:9 5: 0x557f849b282c - core::fmt::write::h68b52f8aa598961e at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/core/src/fmt/mod.rs:1094:21 6: 0x557f8498949e - std::io::Write::write_fmt::hc5568929b662da92 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/io/mod.rs:1714:15 7: 0x557f8498cec5 - std::sys_common::backtrace::_print::h65aecbff12ca83c8 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:47:5 8: 0x557f8498cec5 - std::sys_common::backtrace::print::hf75ac9d60598d247 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:34:9 9: 0x557f8498e483 - std::panicking::default_hook::{{closure}}::hc2cb8da3be7476b0 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:269:22 10: 0x557f8498e19d - std::panicking::default_hook::hefa49c86da66275b at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:288:9 11: 0x557f8498ea09 - std::panicking::rust_panic_with_hook::hd4c3b0056ba96951 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:705:13 12: 0x557f8498e907 - std::panicking::begin_panic_handler::{{closure}}::he487675683e9a525 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:597:13 13: 0x557f8498d516 - std::sys_common::backtrace::__rust_end_short_backtrace::hcff58b9b81620321 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:151:18 14: 0x557f8498e652 - rust_begin_unwind at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:593:5 15: 0x557f848b9333 - core::panicking::panic_fmt::h1b81548733a03bd5 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/core/src/panicking.rs:67:14 16: 0x557f848c3323 - build_script_build::cuda::build_ptx::ha488acce3cd701b3 at /mnt/source1/djbGR/ruststuffs/candle/candle-kernels/build.rs:207:13 17: 0x557f848c0878 - build_script_build::main::h2523e6c20b65fa04 at /mnt/source1/djbGR/ruststuffs/candle/candle-kernels/build.rs:6:33 18: 0x557f848d40cb - core::ops::function::FnOnce::call_once::h385ddf31127d3e12 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/core/src/ops/function.rs:250:5 19: 0x557f848ccbae - std::sys_common::backtrace::__rust_begin_short_backtrace::h1cfd550c72c3e194 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/sys_common/backtrace.rs:135:18 20: 0x557f848e0130 - std::rt::lang_start::{{closure}}::h70dc5fa7783a03f7 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/rt.rs:166:18 21: 0x557f8498541b - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h9eccf02cf11756f6 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/core/src/ops/function.rs:284:13 22: 0x557f8498541b - std::panicking::try::do_call::hc95b838862bbb45a at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:500:40 23: 0x557f8498541b - std::panicking::try::h82935254d12a76fc at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:464:19 24: 0x557f8498541b - std::panic::catch_unwind::h7fd9d11cd70fc350 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panic.rs:142:14 25: 0x557f8498541b - std::rt::lang_start_internal::{{closure}}::h0ddb191e68b650a4 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/rt.rs:148:48 26: 0x557f8498541b - std::panicking::try::do_call::h17d4693c7a6e120c at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:500:40 27: 0x557f8498541b - std::panicking::try::h684fc020e1305912 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panicking.rs:464:19 28: 0x557f8498541b - std::panic::catch_unwind::h757da538db515116 at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/panic.rs:142:14 29: 0x557f8498541b - std::rt::lang_start_internal::ha6b1625a1e9a4f5b at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/rt.rs:148:20 30: 0x557f848e010a - std::rt::lang_start::h0d1360f20fc735dd at /rustc/39f42ad9e8430a8abb06c262346e89593278c515/library/std/src/rt.rs:165:17 31: 0x557f848c43fe - main 32: 0x7fd8be429d90 - __libc_start_call_main at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16 33: 0x7fd8be429e40 - __libc_start_main_impl at ./csu/../csu/libc-start.c:392:3 34: 0x557f848b9a15 - _start 35: 0x0 -

dbrowne avatar Aug 08 '23 21:08 dbrowne

What OS and CUDA version are you using? I seem to be having a similar issue (with loads of C/CU/C++ errors) with both the crates.io and the github versions on Windows.

This is my cargo.toml:

[package]
name = "candle_test"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
# candle-core = {git = "https://github.com/huggingface/candle.git", branch = "main", features = ["cuda"]}
candle-core = {features = ["cuda"], version = "0.1.0"}
# candle-nn = {git = "https://github.com/huggingface/candle.git", branch = "main"}

Rust version is 1.71.0. I'm running CUDA 11.7, which could be the problem, but I can't see anything about what version I should use.

Dominically avatar Aug 09 '23 00:08 Dominically

'Linux version 6.2.0-26-generic (buildd@bos03-amd64-042) (x86_64-linux-gnu-gcc-11 (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2

rustc 1.73.0-nightly (39f42ad9e 2023-07-19)

Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0'

dbrowne avatar Aug 09 '23 02:08 dbrowne

Also seeing thread 'main' panicked at 'nvcc error while compiling "src/affine.cu": on arch.

$ rustc --version
rustc 1.71.0 (8ede3aae2 2023-07-12)
$ pacman -Qi cuda
Name            : cuda
Version         : 12.2.0-1
Description     : NVIDIA's GPU programming toolkit
Architecture    : x86_64
URL             : https://developer.nvidia.com/cuda-zone
Licenses        : custom:NVIDIA
Groups          : None
Provides        : cuda-toolkit  cuda-sdk  libcudart.so=12-64  libcublas.so=12-64  libcublas.so=12-64
                  libcusolver.so=11-64  libcusolver.so=11-64  libcusparse.so=12-64  libcusparse.so=12-64
Depends On      : opencl-nvidia  nvidia-utils  python  gcc12
Optional Deps   : gdb: for cuda-gdb [installed]
                  glu: required for some profiling tools in CUPTI [installed]
Required By     : cudnn  magma-cuda  python-pycuda  python-pytorch-cuda  python-tensorflow-opt-cuda
                  tensorflow-opt-cuda
Optional For    : meshroom-bin  openmpi
Conflicts With  : None
Replaces        : cuda-toolkit  cuda-sdk  cuda-static
Installed Size  : 4.36 GiB
Packager        : Sven-Hendrik Haase <[email protected]>
Build Date      : Sun 02 Jul 2023 01:59:36 PM MDT
Install Date    : Sun 16 Jul 2023 07:35:25 AM MDT
Install Reason  : Installed as a dependency for another package
Install Script  : Yes
Validated By    : Signature

n8henrie avatar Aug 09 '23 03:08 n8henrie

What cards are you guys having ?

We need compute_cap>7.0 for it to work. I know compute_cap 5.2 does trigger similar fails.

The core kernels we have use f16 and bf16 and those old cards cannot compile them properly. I added some flags for some options but I didn't check all potential caps yet.

Narsil avatar Aug 09 '23 07:08 Narsil

I could get my own project to compile by specifying the version on the command line. ~~To see what your system supports use nvcc --list-gpu-code~~

CUDA_COMPUTE_CAP=90 cargo build --release --features cuda

However, during runtime I get DriverError(CUDA_ERROR_NOT_SUPPORTED, "operation not supported"). So GPU acceleration does not work for me. It looks like that's something else though, as this also doesn't work for me: https://askubuntu.com/a/1215237. That's not related to candle at all, so no need to fix that in this thread.

krolinventions avatar Aug 09 '23 08:08 krolinventions

Ok, did some more investigation. It does turn out that my device only supports up to 50. So I can get pure CUDA code to run if I compile with nvcc -arch=sm_50. I can also get my application that uses candle to compile with that, but it still gives me the driver error, so I guess it's not supported.

My GPU (Quadro M620) runs torch fine, so would be great if candle could add support for it! Mainly because it's a nice laptop to develop on.

krolinventions avatar Aug 09 '23 08:08 krolinventions

@krolinventions Perfectly understand. My own GTX 970 is too old to run candle atm.

However, in order to deliver fast we had to cut corners in that department. Currently I think I would like to focus on giving a good error message before actually writing kernels that work on old hardware. I may do it on my spare time to be able to use my old GPU, but I know the time it takes.

If you want to take a stab at it, you're more than welcome !

Narsil avatar Aug 09 '23 08:08 Narsil

@Narsil Actually, I think just using the CPU for development is actually fine. It's great to not have to deal with installing all that extra stuff, like with torch, or the cuda libraries.

On looking at the kernels: I have never used CUDA before but from the few examples that I've seen it looks rather nice. I think I may need a little more experience with it before tackling these, but maybe!

krolinventions avatar Aug 09 '23 09:08 krolinventions

Do try, it's not as daunting as it looks (it's daunting when you want the best possible performance). Feel free to join the Discord HF on channel candle to pursue the discussion.

I'lll keep the issue open to give better error message.

Narsil avatar Aug 09 '23 10:08 Narsil

We need compute_cap>7.0 for it to work.

I'm using a 1080TI

$ nvidia-smi --query-gpu=compute_cap --format=csv,noheader
6.1

😢

n8henrie avatar Aug 09 '23 12:08 n8henrie

What cards are you guys having ?

Just tried compiling with RTX 3060 (compute = 8.6) and CUDA 12.2 on Windows and still getting a massive error log of C errors.

e.g.:

error: asm operand type size(8) does not match type/size implied by constraint 'r'
    static __declspec(__device__) __inline longlong2 __ldg(const longlong2 *ptr) { longlong2 ret; asm volatile ("ld.global.nc.v2.s64 {%0,%1}, [%2];"  : "=l"(ret.x), "=l"(ret.y) : "r" (ptr)); return ret; }

along with other errors that are spammed a lot of times.

Dominically avatar Aug 09 '23 13:08 Dominically

Yes Windows seems to be having issues. I've been told in discord WSL is ok.

Narsil avatar Aug 09 '23 17:08 Narsil

What cards are you guys having ?

We need compute_cap>7.0 for it to work. I know compute_cap 5.2 does trigger similar fails.

The core kernels we have use f16 and bf16 and those old cards cannot compile them properly. I added some flags for some options but I didn't check all potential caps yet.

My compute cap is 6.1 It is a stretch but I'm willing to work on this if you point me in the right direction.

dbrowne avatar Aug 09 '23 19:08 dbrowne

@dbrowne

Go to candle/candle-kernels/src/

And try to make the .cu compile:

nvcc --ptx --gpu-architecture=sm_61 affine.cu -I.

Most of the logic should be in compatibility.cuh. 61 should be easier than 52 and earlier.

Is seems I have a fix for 61

Narsil avatar Aug 10 '23 07:08 Narsil

Can you take my PR out for a spin ?

https://github.com/huggingface/candle/pull/386

It fixes compilation but it still doesn't work on my 52 because the ops are still not there. However once you have the PTX you can test out of candle and debug by more classical means to try and understand why it compiles but fails to run.

Narsil avatar Aug 10 '23 07:08 Narsil

Looks like that PR has been merged!

On current master, my 1080TI now works like a charm, thank you!

On my threadripper takes 13-15s for the example:

$ time cargo run --example whisper --release 
    Finished release [optimized] target(s) in 0.28s
     Running `target/release/examples/whisper`
Running on CPU, to run on GPU, build this example with `--features cuda`
No audio file submitted: Downloading https://huggingface.co/datasets/Narsil/candle_demo/blob/main/samples_jfk.wav
loaded wav data: Header { audio_format: 1, channel_count: 1, sampling_rate: 16000, bytes_per_second: 32000, bytes_per_sample: 2, bits_per_sample: 16 }
pcm data loaded 176000
loaded mel: [1, 80, 3000]
audio features: [1, 1500, 384]
3000: Segment { start: 0.0, duration: 30.0, dr: DecodingResult { tokens: [50257, 50363, 843, 523, 616, 5891, 3399, 1265, 407, 644, 534, 1499, 460, 466, 329, 345, 1265, 644, 345, 460, 466, 329, 534, 1499, 13, 50903, 50256], text: " And so my fellow Americans ask not what your country can do for you ask what you can do for your country.", avg_logprob: -0.3303277552190798, no_speech_prob: 0.017772182822227478, temperature: 0.0, compression_ratio: NaN } }, in 15.577960389s

real	0m17.062s
user	0m12.536s
sys	0m3.221s

Enabling the cuda feature takes it well below a second (~2 seconds total runtime). Wow!

$ time cargo run --example whisper --release --features cuda
    Finished release [optimized] target(s) in 0.31s
     Running `target/release/examples/whisper`
No audio file submitted: Downloading https://huggingface.co/datasets/Narsil/candle_demo/blob/main/samples_jfk.wav
loaded wav data: Header { audio_format: 1, channel_count: 1, sampling_rate: 16000, bytes_per_second: 32000, bytes_per_sample: 2, bits_per_sample: 16 }
pcm data loaded 176000
loaded mel: [1, 80, 3000]
audio features: [1, 1500, 384]
3000: Segment { start: 0.0, duration: 30.0, dr: DecodingResult { tokens: [50257, 50363, 843, 523, 616, 5891, 3399, 1265, 407, 644, 534, 1499, 460, 466, 329, 345, 1265, 644, 345, 460, 466, 329, 534, 1499, 13, 50903, 50256], text: " And so my fellow Americans ask not what your country can do for you ask what you can do for your country.", avg_logprob: -0.3305633301574319, no_speech_prob: 0.017772099003195763, temperature: 0.0, compression_ratio: NaN } }, in 300.960168ms

real	0m2.166s
user	0m1.582s
sys	0m0.433s

CUDA-backed NNs in Rust!? This is really exciting :)

n8henrie avatar Aug 11 '23 15:08 n8henrie

@n8henrie This is far from optimized yet ;).

We ran a few passes, but there's still a lot more that can be done

Narsil avatar Aug 11 '23 15:08 Narsil

@dbrowne

Go to candle/candle-kernels/src/

And try to make the .cu compile:

nvcc --ptx --gpu-architecture=sm_61 affine.cu -I.

Most of the logic should be in compatibility.cuh. 61 should be easier than 52 and earlier.

Is seems I have a fix for 61

I'm awaiting delivery of a RTX a4500. If it does not work in my workstation I will begin in earnest to pursue this.

dbrowne avatar Aug 12 '23 05:08 dbrowne

Does it work now on main ? I made fixes for older cards (still far from universal support but should be much better)

Narsil avatar Aug 12 '23 06:08 Narsil

Does it work now on main ? I made fixes for older cards (still far from universal support but should be much better)

Yes

dbrowne avatar Aug 13 '23 22:08 dbrowne

@Narsil Also works for me (Quadro M620). Both the examples and my own code. Thanks!

krolinventions avatar Aug 14 '23 07:08 krolinventions

If people in the future have similar CUDA compilation errors (functions already being defined, etc.), it's because the compatilibty header compatibility.cuh is not perfect. Not all graphics cards and all driver versions have the same functions, which is why compatibility.cuh attempts to emulate them. Unfortunately, which gpus and which drivers introduce each function is not well documented by NVIDIA, so several guesses were made regarding when to emulate things.

If you get such an error, please report it here (or as an issue on candle or dfdx). You can use these commands to give helpful context:

> nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
name, compute_cap, driver_version
NVIDIA T500, 7.5, 536.25

> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

For more information:

  • Compute capability explanation: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
  • Tables showing compute capability for each GPU: https://developer.nvidia.com/cuda-gpus

ViliamVadocz avatar Aug 16 '23 10:08 ViliamVadocz

Getting a similar error using a 2080 ti with cuda version 12.2, driver 535.86.05 on POP OS 22.04.

compatibility.cuh(11): error: identifier "__hmax" is undefined

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: identifier "__hmin" is undefined

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

Might be a result of the cuda toolkit being an older version. Pop OS does not have a newer driver compatible with the latest cuda toolkit yet. Hopefully driver differences won't matter much in the future.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

bayedieng avatar Aug 18 '23 21:08 bayedieng

Can't compile with cuda feature. Does the following info help (anything a clueless cuda newbie could do)?

$ nvcc --ptx --gpu-architecture=sm_61 affine.cu -I.
compatibility.cuh(11): error: identifier "__hmax" is undefined

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: identifier "__hmin" is undefined

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

6 errors detected in the compilation of "affine.cu".
$ nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
name, compute_cap, driver_version
NVIDIA GeForce GTX 1080, 6.1, 535.86.05
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
$ uname -ar
Linux visi2 5.19.0-41-generic #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

theHausdorffMetric avatar Aug 19 '23 11:08 theHausdorffMetric

@bayedieng @theHausdorffMetric

compatibility.cuh(11): error: identifier "__hmax" is undefined

Yes this means cuda 11.5 doesn't have this function, therefore the compat layer doesn't work.

Upgrading cuda should help, at least 11.8.

Narsil avatar Aug 21 '23 08:08 Narsil

OK, not sure if this is an edge case. I was trying out candle on nvidia jetson nano. candle failed to detect cuda because it uses nvidia-smi while jetson uses tegrastats

GeauxEric avatar Aug 28 '23 06:08 GeauxEric

still doesn't work on main (4abc1ea34dbc834e561f442737faf2c735f0a6ce), here are yet more error messages

$ nvcc --ptx --gpu-architecture=sm_61 affine.cu -I.

compatibility.cuh(11): error: identifier "__hmax" is undefined

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(11): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: identifier "__hmin" is undefined

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

compatibility.cuh(14): error: ambiguous "?" operation: second operand of type "__half" can be converted to third operand type "<error-type>", and vice versa

6 errors detected in the compilation of "affine.cu".
$ nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
name, compute_cap, driver_version
NVIDIA GeForce RTX 2060, 7.5, 535.104.05

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

$ uname -a
Linux chorusfruit 6.2.6-76060206-generic #202303130630~1689015125~22.04~ab2190e SMP PREEMPT_DYNAMIC Mon J x86_64 x86_64 x86_64 GNU/Linux

tezlm avatar Sep 26 '23 11:09 tezlm

try with Cuda >=12

Narsil avatar Sep 26 '23 15:09 Narsil

I also have an issue getting candle to utilize the GPU.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

$ nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
name, compute_cap, driver_version
NVIDIA GeForce RTX 2070, 7.5, 535.104.12

$ uname -a
Linux xxx 5.4.0-164-generic #181-Ubuntu SMP Fri Sep 1 13:41:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I get this error when I run the Mistral example:

$ cargo run --example mistral --features cuda --release -- --prompt "Here is a sample quick sort implementation in rust " --quantized -n 400
avx: true, neon: false, simd128: false, f16c: true
temp: 0.00 repeat-penalty: 1.10 repeat-last-n: 64
retrieved the files in 128.695µs
loaded the model in 2.726856172s
Here is a sample quick sort implementation in rust Illegal instruction (core dumped)

When I edit ~/candle/.cargo/config.toml to build with the flags rustflags = ["-C", "target-cpu=native", "-C", "target-feature=-avx,-avx2"] (basically I copy and insert the build flags for [target.x86_64-apple-darwin], inspired by your suggestion in Issue #622) it runs fine on the CPU.

dashdeckers avatar Oct 13 '23 14:10 dashdeckers

I doubt that it would be cuda related as the quantized models are supposed to be always on the cpu at the moment. Could you try to run with the exact same setup than when it crashes but remove the --features cuda? And keep the features flag and add --cpu. Finaly if you could launch this in a gdb and send back the backtrace that could be very useful (and in this case better to compile with the debug symbols, i.e. replace --release with --profile=release-with-debug). Thanks

LaurentMazare avatar Oct 13 '23 14:10 LaurentMazare