kaldi icon indicating copy to clipboard operation
kaldi copied to clipboard

Segmentation fault (core dumped) in UnitTestCuVectorAddRowSumMat for large matrix dimensions

Open nayakajay opened this issue 4 years ago • 9 comments

Tested on: Titan RTX, cuda 11.0, Driver 455.51.05 The system has abundant RAM (>100G) and is a Intel Xeon processor

I have been trying to run the tests provided in cu-matrix-test.cc. I am interested in a particular test, UnitTestCuVectorAddRowSumMat. To run only 1 particular test, I have commented all the other tests in "CudaMatrixUnitTest" function and have modified the "main" function in the test file as

int main() {
  SetVerboseLevel(1);
  int32 loop = 0;

#if HAVE_CUDA == 1
  for (loop = 1; loop < 2; loop++) {
    CuDevice::Instantiate().SetDebugStrideMode(true);
    if (loop == 0)
      CuDevice::Instantiate().SelectGpuId("no");
    else
      CuDevice::Instantiate().SelectGpuId("yes");
#endif

    kaldi::CudaMatrixUnitTest<double>();

    if (loop == 0)
      KALDI_LOG << "Tests without GPU use succeeded.";
    else
      KALDI_LOG << "Tests with GPU use (if available) succeeded.";

#if HAVE_CUDA == 1
  } // No for loop if 'HAVE_CUDA != 1',
  CuDevice::Instantiate().PrintProfile();
#endif
  return 0;
}

As can be seen, I run the test only for double. In the test "UnitTestCuVectorAddRowSumMat", I give X=65000, Y=64360 (well within limits of int32). I am observing segmentation faults in that case. For X=45000, Y=44550, the test runs successfully. Am I doing something wrong?

Sample output

$ ./cu-matrix-test
LOG ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:172) Manually selected to compute on CPU.
Segmentation fault (core dumped)

The GPU code is running fine, I think, the relevant output is (by setting loop=1 in the main shown earlier)

$ ./cu-matrix-test
WARNING ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:247) Not in compute-exclusive mode.  Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:446) Selecting from 1 GPUs
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:461) cudaSetDevice(0): TITAN RTX   free:24048M, used:172M, total:24220M, free/total:0.992899
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:509) Device: 0, mem_ratio: 0.992899
LOG ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:390) Trying to select device: 0
LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:519) Success selecting device 0 free mem ratio: 0.992899
LOG ([5.5.854~1-403d]:FinalizeActiveGpu():cu-device.cc:346) The active GPU is [0]: TITAN RTX    free:23566M, used:654M, total:24220M, free/total:0.972998 version 7.5
Segmentation fault (core dumped)

nayakajay avatar Feb 18 '21 08:02 nayakajay

LIkely the problem is that the product of the x dim and y dim is outside of int32. The CPU code might need to be modified to handle that. I would merge a PR if you could make one.

On Thu, Feb 18, 2021 at 4:03 PM Ajay Nayak [email protected] wrote:

Tested on: Titan RTX, cuda 11.0, Driver 455.51.05 The system has abundant RAM (>100G) and is a Intel Xeon processor

I have been trying to run the tests provided in cu-matrix-test.cc. I am interested in a particular test, UnitTestCuVectorAddRowSumMat. To run only 1 particular test, I have commented all the other tests in "CudaMatrixUnitTest" function and have modified the "main" function in the test file as

int main() { SetVerboseLevel(1); int32 loop = 0;

#if HAVE_CUDA == 1 for (loop = 1; loop < 2; loop++) { CuDevice::Instantiate().SetDebugStrideMode(true); if (loop == 0) CuDevice::Instantiate().SelectGpuId("no"); else CuDevice::Instantiate().SelectGpuId("yes"); #endif

kaldi::CudaMatrixUnitTest<double>();

if (loop == 0)
  KALDI_LOG << "Tests without GPU use succeeded.";
else
  KALDI_LOG << "Tests with GPU use (if available) succeeded.";

#if HAVE_CUDA == 1 } // No for loop if 'HAVE_CUDA != 1', CuDevice::Instantiate().PrintProfile(); #endif return 0; }

As can be seen, I run the test only for double. In the test "UnitTestCuVectorAddRowSumMat", I give X=65000, Y=64360 (well within limits of int32). I am observing segmentation faults in that case. For X=45000, Y=44550, the test runs successfully. Am I doing something wrong?

Sample output

$ ./cu-matrix-test LOG ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:172) Manually selected to compute on CPU. Segmentation fault (core dumped)

The GPU code is running fine, I think, the relevant output is (by setting loop=1 in the main shown earlier)

$ ./cu-matrix-test WARNING ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:247) Not in compute-exclusive mode. Suggestion: use 'nvidia-smi -c 3' to set compute exclusive mode LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:446) Selecting from 1 GPUs LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:461) cudaSetDevice(0): TITAN RTX free:24048M, used:172M, total:24220M, free/total:0.992899 LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:509) Device: 0, mem_ratio: 0.992899 LOG ([5.5.854~1-403d]:SelectGpuId():cu-device.cc:390) Trying to select device: 0 LOG ([5.5.854~1-403d]:SelectGpuIdAuto():cu-device.cc:519) Success selecting device 0 free mem ratio: 0.992899 LOG ([5.5.854~1-403d]:FinalizeActiveGpu():cu-device.cc:346) The active GPU is [0]: TITAN RTX free:23566M, used:654M, total:24220M, free/total:0.972998 version 7.5 Segmentation fault (core dumped)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4458, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO35LLMJHNGHVZPGS5TS7TCWXANCNFSM4XZ2GVZQ .

danpovey avatar Feb 18 '21 08:02 danpovey

A solution would be, to change MatrixDimT (matrix/matrix-common.h) and MatrixDimT_cuda (cudamatrix/cu-matrixdim.h) from int32 and int32_t to int64 and int64_t?

Edited: Never mind, it can cause problems.

nayakajay avatar Feb 18 '21 10:02 nayakajay

Yes

On Thu, Feb 18, 2021 at 6:01 PM Ajay Nayak [email protected] wrote:

A solution would be, to change MatrixDimT (matrix/matrix-common.h) and MatrixDimT_cuda (cudamatrix/cu-matrixdim.h) from int32 and int32_t to int64 and int64_t?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4458#issuecomment-781228440, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4GYHPX4VASGRESLJ3S7TQPNANCNFSM4XZ2GVZQ .

danpovey avatar Feb 18 '21 11:02 danpovey

Wait, no... just change that one function, that is failing, to use larger size types where necessary.

danpovey avatar Feb 18 '21 11:02 danpovey

The only place I could think where product of x_dim and y_dim will be done is during memory allocation (also a cause of SegFault?). But it seems that is already taken care of with a static_cast. Matrix allocation.

The functions where actual operations happen use cblas_* functions.

nayakajay avatar Feb 19 '21 06:02 nayakajay

Run it in gdb and get a stack: gdb matrix-lib-test (gdb) r .. crash.. (gdb) bt

On Fri, Feb 19, 2021 at 2:10 PM Ajay Nayak [email protected] wrote:

The only place I could think where product of x_dim and y_dim will be done is during memory allocation (also a cause of SegFault?). But it seems that is already taken care of with a static_cast. Matrix allocation https://github.com/kaldi-asr/kaldi/blob/master/src/matrix/kaldi-matrix.cc#L804 .

The functions where actual operations happen use cblas_* functions.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4458#issuecomment-781851545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO33T3YEB6KJLMVVSZLS7X6EFANCNFSM4XZ2GVZQ .

danpovey avatar Feb 19 '21 06:02 danpovey

Solution: In all the trouble causing places, I added a static_cast<size_t>. The test seem to be passing (no ASSERT failure).

Problems: For double type, the first problem was occurring at kaldi-matrix.h. This seemed to get rid of the error. I tried to fit larger data on the GPU by changing the data type to float. On using float type; the another issue showed up kaldi-matrix.cc

The trouble also occurred in the cuda kernel cu-kernels.cu. It threw "illegal memory accessed". Used cuda_memcheck to get that information.

I wanted to know if there are any assumptions w.r.t the _strided_reduction_fused kernel, regarding the dimensions of the matrix passed to it. Can it be rectangular (number of rows <<< number of columns or vice-versa)?

I wanted to get information regarding any real-world applications using this specific kernel. Any hints or suggestions in looking for such applications will be really great.

nayakajay avatar Feb 25 '21 09:02 nayakajay

Can you please make a pull request with the errors you fixed? Sorry, you'll have to look of that kernel yourself, and where it's called, I didn't write it and am not familiar. This line: int idx = colStart + j * d.stride; concerns me. I'm not sure what int is, it could be 32 bit and that could overflow; could cast to size_t and make sure idx is also of type size_t.

danpovey avatar Feb 25 '21 11:02 danpovey

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

stale[bot] avatar May 13 '21 11:05 stale[bot]