pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

Enable gesvda and fix geqrf

Open xinyazhang opened this issue 1 year ago • 6 comments

Fixes SWDEV-407984 and SWDEV-392430

According to Math Library team, it is expected behavior to return error when batch_count == 0. Hence I'm making the temporary workaround permanent.

PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py --verbose --use-pytest -i test_linalg.py shows

FAILED [0.0052s] test_linalg.py::TestLinalgCUDA::test_linalg_lstsq_batch_broadcasting_cuda_complex128                                                                                                                                         
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
======== 1 failed, 233 passed, 39 skipped, 2 rerun in 96.84s (0:01:36) =========```

xinyazhang avatar Aug 07 '23 20:08 xinyazhang

@xinyazhang Could you please add a comment to this PR showing test_linalg.py TestLinalgCUDA.test_svd_* passes for each data type? Make sure to force the gesvda path to verify the new gesvda drivers are indeed being called. Thank you!

alugorey avatar Aug 08 '23 15:08 alugorey

Hi @alugorey do you know how to enable the logging of rocSOLVER? I've tried ROCSOLVER_LAYER=7 ROCSOLVER_LEVELS=99 ROCSOLVER_LOG_TRACE_PATH=t ROCSOLVER_LOG_BENCH_PATH=b ROCSOLVER_LOG_PROFILE_PATH=p but it doesn't work

xinyazhang avatar Aug 08 '23 18:08 xinyazhang

Okay I've fixed the enabling of the gesvda but apparently the U and Vh are incorrect, even though the Sigmas are right.

I'll handle over this to math library team after confirmation with rocSOLVER.

xinyazhang avatar Aug 08 '23 20:08 xinyazhang

Hi @alugorey do you know how to enable the logging of rocSOLVER? I've tried ROCSOLVER_LAYER=7 ROCSOLVER_LEVELS=99 ROCSOLVER_LOG_TRACE_PATH=t ROCSOLVER_LOG_BENCH_PATH=b ROCSOLVER_LOG_PROFILE_PATH=p but it doesn't work

hi @xinyazhang , sorry i'm just catching up with email. I've only worked with ROCBLAS_LAYER=N. I'd assume this is what you need as rocSOLVER is just a thin wrapper around rocBLAS. You can find more details here: https://confluence.amd.com/pages/viewpage.action?spaceKey=~pensun&title=Collect+unique+rocBLAS+and+MIOpen+configs+from+application

alugorey avatar Aug 11 '23 15:08 alugorey

Just to document, in order to enable ROCSOLVER logging mechanism rocsolver_log_begin is needed. The following python code makes it possible to enable ROCSOLVER logging without re-compiling torch

#!/usr/bin/env python

from cffi import FFI
ffi = FFI()

ffi.cdef("""
typedef enum rocblas_status_
{
    rocblas_status_success         = 0, /**< Success */
    rocblas_status_invalid_handle  = 1, /**< Handle not initialized, invalid or null */
    rocblas_status_not_implemented = 2, /**< Function is not implemented */
    rocblas_status_invalid_pointer = 3, /**< Invalid pointer argument */
    rocblas_status_invalid_size    = 4, /**< Invalid size argument */
    rocblas_status_memory_error    = 5, /**< Failed internal memory allocation, copy or dealloc */
    rocblas_status_internal_error  = 6, /**< Other internal library failure */
    rocblas_status_perf_degraded   = 7, /**< Performance degraded due to low device memory */
    rocblas_status_size_query_mismatch = 8, /**< Unmatched start/stop size query */
    rocblas_status_size_increased      = 9, /**< Queried device memory size increased */
    rocblas_status_size_unchanged      = 10, /**< Queried device memory size unchanged */
    rocblas_status_invalid_value       = 11, /**< Passed argument not valid */
    rocblas_status_continue            = 12, /**< Nothing preventing function to proceed */
    rocblas_status_check_numerics_fail
    = 13, /**< Will be set if the vector/matrix has a NaN/Infinity/denormal value */
    rocblas_status_excluded_from_build
    = 14, /**< Function is not available in build, likely a function requiring Tensile built without Tensile */
    rocblas_status_arch_mismatch
    = 15, /**< The function requires a feature absent from the device architecture */
} rocblas_status;
rocblas_status rocsolver_log_begin();
""")
C = ffi.dlopen('/opt/rocm/lib/librocsolver.so.0.1.60000')
C.rocsolver_log_begin()

xinyazhang avatar Aug 14 '23 16:08 xinyazhang

This is a hipSOLVER problem, tracked by https://ontrack-internal.amd.com/browse/SWDEV-421983 Will re-test this again after hipSOLVER being fixed.

xinyazhang avatar Sep 14 '23 17:09 xinyazhang