juice icon indicating copy to clipboard operation
juice copied to clipboard

cuda-memcheck: "Address ... is out of bounds"

Open hweom opened this issue 2 years ago • 1 comments

Describe the bug

cuda-memcheck reports scrolling errors on example-mnist-classification like this:

========= Invalid __global__ write of size 4
=========     at 0x00001780 in void copy_kernel<float>(cublasCopyParams<float>)
=========     by thread (191,0,0) in block (0,0,0)
=========     Address 0x7fd319043efc is out of bounds

To Reproduce

Steps to reproduce the behaviour:

  1. cargo build
  2. cuda-memcheck target/debug/example-mnist-classification mnist linear

Expected behavior

No errors.

Please complete the following information:

  • System: Manjaro Linux
  • Version: Git commit 854043d50bd
  • Rust: rustc 1.62.1 (e092d0b6b 2022-07-16)
  • Environment:
  • Backends (if relevant):
    • cuda:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 515.57       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   51C    P8     5W /  N/A |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1936      G   /usr/lib/Xorg                       4MiB |
+-----------------------------------------------------------------------------+

Additional context

Note that running example-mnist-classification without cuda-memcheck works just fine and is able to converge. I only discovered this while working on #159 where doing training with CUDA does crash with CUDA_ERROR_ILLEGAL_ADDRESS when trying to copy from GPU to host. Not sure it's the same issue, but seems related.

hweom avatar Jul 27 '22 22:07 hweom

Looks like the copy() function for CUDA doesn't check that the destination is large enough (here).

By adding a check like this:

macro_rules! iblas_copy_for_cuda {
    ($t:ident) => {
        fn copy(
            &self,
            x: &SharedTensor<$t>,
            y: &mut SharedTensor<$t>,
        ) -> Result<(), ::coaster::error::Error> {
            assert_eq!(x.desc().size(), y.desc().size());

We now get a panic:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `300`,
 right: `10`', coaster-blas/src/frameworks/cuda/mod.rs:23:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
   4: coaster_blas::frameworks::cuda::<impl coaster_blas::plugin::Copy<f32> for coaster::backend::Backend<coaster::frameworks::cuda::Cuda>>::copy
             at ./coaster-blas/src/frameworks/cuda/helper.rs:109:13
   5: <juice::layers::common::linear::Linear as juice::layer::ComputeParametersGradient<f32,B>>::compute_parameters_gradient
             at ./juice/src/layers/common/linear.rs:220:9

So we're trying to copy 300 floats into a tensor with size 10. And it happens in the Linear layer here

hweom avatar Jul 27 '22 23:07 hweom