Cytnx icon indicating copy to clipboard operation
Cytnx copied to clipboard

Build CI CD for gpu.

Open hunghaoti opened this issue 1 year ago • 17 comments

Currently, we only test cpu version for every commit. We may need to add for gpu version (maybe running test on local machine.)

hunghaoti avatar Dec 14 '24 06:12 hunghaoti

There is a way to do GPU testing for CI but I don't think we have bandwidth to do that

kaihsin avatar Dec 14 '24 07:12 kaihsin

We are bringing in some help.

yingjerkao avatar Dec 14 '24 14:12 yingjerkao

Hi, I would like to contribute this part.

The related PR will use nvidia/cuda docker image. What's the CUDA version we intent to use in CI?

jysh1214 avatar Dec 16 '24 13:12 jysh1214

Let's start with 11.8? ideally we want to upgrade to 12 fully

On Mon, Dec 16, 2024, 08:30 Alex Chiang @.***> wrote:

Hi, I would like to contribute this part.

The related PR will use nvidia/cuda docker image. What's the CUDA version we intent to use in CI?

— Reply to this email directly, view it on GitHub https://github.com/Cytnx-dev/Cytnx/issues/538#issuecomment-2545639139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFCX3SKPU3JO4DRXSVUUUXD2F3IXRAVCNFSM6AAAAABTTIAKWOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBVGYZTSMJTHE . You are receiving this because you commented.Message ID: @.***>

kaihsin avatar Dec 16 '24 13:12 kaihsin

I think we are using ubuntu 20 or 22, right?

jysh1214 avatar Dec 16 '24 13:12 jysh1214

22 would be nice. Also some DevOps help on setting up GitHub CI actions of that would be great.

On Mon, Dec 16, 2024, 08:44 Alex Chiang @.***> wrote:

I think we are using ubuntu 20 or 22, right?

— Reply to this email directly, view it on GitHub https://github.com/Cytnx-dev/Cytnx/issues/538#issuecomment-2545669352, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFCX3SNFHNE73EBYCEKWJW32F3KKTAVCNFSM6AAAAABTTIAKWOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBVGY3DSMZVGI . You are receiving this because you commented.Message ID: @.***>

kaihsin avatar Dec 16 '24 14:12 kaihsin

@kaihsin I try to build and test in the nvidia container (nvidia/cuda:11.8.0-devel-ubuntu22.04).

Here is my config:

cmake -B $CYTNX_BUILD -G Ninja \
  -D CMAKE_BUILD_TYPE=Debug \
  -D Python_EXECUTABLE="$(which python3.11)" \
  -D Python_INTERPRETER="$(which python3.11)" \
  -D Python3_EXECUTABLE="$(which python3.11)" \
  -D Python3_INTERPRETER="$(which python3.11)" \
  -D BLAS_LIBRARIES="$OPENBLAS_BLAS_LIB" \
  -D LAPACK_LIBRARIES="$OPENBLAS_LAPACKE_LIB" \
  -D USE_MKL=OFF \
  -D USE_CUDA=ON \
  -D USE_CUTENSOR=OFF \
  -D USE_CUQUANTUM=OFF \
  -D USE_CUTT=OFF \
  -D USE_MAGMA=OFF \
  -D RUN_TESTS=ON

The 43 tests failed. Is It expected? Can the tests pass with only USE_CUDA enabled?

95% tests passed, 43 tests failed out of 891

Total Test time (real) = 413.69 sec

The following tests FAILED:
        640 - NetworkTest.gpu_Network_dense_no_order (Failed)
        641 - NetworkTest.gpu_Network_dense_find_optimal (Failed)
        642 - NetworkTest.gpu_Network_dense_order_line (Failed)
        643 - NetworkTest.gpu_Network_dense_specified_order (Failed)
        644 - NetworkTest.gpu_Network_dense_reuse (Failed)
        646 - NconTest.gpu_ncon_default_order (Failed)
        647 - NconTest.gpu_ncon_specified_order (Failed)
        648 - NconTest.gpu_ncon_optimal_order (Failed)
        649 - ContractTest.gpu_Contract_denseUt_optimal_order (Failed)
        650 - ContractTest.gpu_Contract_denseUt_default_order (Failed)
        651 - ContractTest.gpu_Contract_denseUt_specified_order (Failed)
        652 - ContractTest.gpu_Contract_denseUt_optimal_specified_order (Failed)
        653 - ContractTest.gpu_Contracts_denseUt_optimal_order (Failed)
        654 - ContractTest.gpu_Contracts_denseUt_default_order (Failed)
        655 - ContractTest.gpu_Contracts_denseUt_specified_order (Failed)
        656 - ContractTest.gpu_Contracts_denseUt_optimal_specified_order (Failed)
        657 - BlockUniTensorTest.gpu_Trace (Failed)
        706 - BlockUniTensorTest.gpu_group_basis (SEGFAULT)
        710 - DenseUniTensorTest.gpu_Trace (Failed)
        775 - Directsum.gpu_one_elem_tens (SEGFAULT)
        779 - Directsum.gpu_shared_axis_contains_all_tens_one_elem (SEGFAULT)
        789 - ExpM.gpu_ExpM_test (Failed)
        791 - Lanczos_Gnd.gpu_Bk_Lanczos_Gnd_test (Failed)
        792 - Arnoldi.gpu_which_LM_test (Failed)
        793 - Arnoldi.gpu_which_LR_test (Failed)
        794 - Arnoldi.gpu_which_LI_test (Failed)
        795 - Arnoldi.gpu_which_SM_test (Failed)
        796 - Arnoldi.gpu_which_SR_test (Failed)
        797 - Arnoldi.gpu_which_SI_test (Failed)
        798 - Arnoldi.gpu_mat_type_real_test (Failed)
        799 - Arnoldi.gpu_k1_test (Failed)
        800 - Arnoldi.gpu_k_max (Failed)
        801 - Arnoldi.gpu_smallest_dim (Failed)
        807 - Svd.gpu_dense_one_elem (Failed)
        808 - Svd.gpu_dense_nondiag_test (Failed)
        813 - Svd.gpu_U1_zeros_test (Failed)
        820 - Gesvd.gpu_dense_one_elem (Failed)
        821 - Gesvd.gpu_dense_nondiag_test (Failed)
        835 - linalg_Test.gpu_BkUt_Svd_truncate3 (Failed)
        836 - linalg_Test.gpu_BkUt_Qr1 (Failed)
        838 - linalg_Test.gpu_BkUt_expM (Failed)
        839 - linalg_Test.gpu_DenseUt_Gesvd_truncate (Failed)
        840 - linalg_Test.gpu_DenseUt_Svd_truncate (Failed)

BTW, all tests passed when USE_CUDA=OFF.

jysh1214 avatar Dec 24 '24 02:12 jysh1214

If you built on the master branch, it's expected. If you build on the dev-master branch there are fewer falling tests but still a few.

IvanaGyro avatar Dec 24 '24 02:12 IvanaGyro

I got it. I think we should list all the failed tests and plan to fix them in the future. The list will be included in the PR. Does that make sense?

jysh1214 avatar Dec 24 '24 11:12 jysh1214

Which branch should I submit my pr to? master or dev-master?

BTW, I cannot build dev-master successfully. Is It expected? We can discuss this issues on another thread.

jysh1214 avatar Dec 25 '24 05:12 jysh1214

I think dev-master is fine. Could you try to use cuda:12.2 to build?

hunghaoti avatar Dec 25 '24 05:12 hunghaoti

These tests are expected to fail on the develop branch.

linalg_Test.gpu_BkUt_Svd_truncate3 linalg_Test.gpu_BkUt_expM Svd.gpu_U1_zeros_test Lanczos_Gnd.gpu_Bk_Lanczos_Gnd_test

IvanaGyro avatar Dec 25 '24 06:12 IvanaGyro

I think dev-master is fine. Could you try to use cuda:12.2 to build?

I builded Cytnx with cuda:12.6.3 successfully.

These tests are expected to fail on the develop branch.

3 extra tests failed:

The following tests FAILED:
        972 - Directsum.gpu_one_elem_tens (SEGFAULT)
        976 - Directsum.gpu_shared_axis_contains_all_tens_one_elem (SEGFAULT)
        1100 - BlockUniTensorTest.gpu_group_basis (SEGFAULT)

My commit: https://github.com/jysh1214/Cytnx/commit/50aa8fe2e18738e3e972deca921b4a7c71a2cb54

jysh1214 avatar Dec 25 '24 08:12 jysh1214

I can't copy the whole environment. However, you may try these changes to suppress the segmentation fault error.

To build in the debug mode, the compiler flag -D USE_DEBUG=ON should be set even if CMAKE_BUILD_TYPE is set to Debug. The env value ASAN_OPTIONS="protect_shadow_gap=0:replace_intrin=0:detect_leaks=0" should be set while running tests. Refer to: https://github.com/google/sanitizers/issues/629#issuecomment-161755902 https://stackoverflow.com/a/68027496

IvanaGyro avatar Dec 25 '24 09:12 IvanaGyro

I still encountered some errors:

The following tests FAILED:
        174 - SearchTreeTest.BasicSearchOrder2 (Failed)
        972 - Directsum.gpu_one_elem_tens (Failed)
        976 - Directsum.gpu_shared_axis_contains_all_tens_one_elem (Failed)
        1100 - BlockUniTensorTest.gpu_group_basis (Failed)

I will try to fix them later.

My commit: https://github.com/jysh1214/Cytnx/commit/58d1473060c8587d6866f0cfca755d35faf029a2

jysh1214 avatar Dec 25 '24 12:12 jysh1214

I still encountered some errors:

The following tests FAILED:
        174 - SearchTreeTest.BasicSearchOrder2 (Failed)
        972 - Directsum.gpu_one_elem_tens (Failed)
        976 - Directsum.gpu_shared_axis_contains_all_tens_one_elem (Failed)
        1100 - BlockUniTensorTest.gpu_group_basis (Failed)

I will try to fix them later.

My commit: jysh1214@58d1473

Could you provide the log file in build/Testing/Temporary/LastTest.log when you run these Failed test?

hunghaoti avatar Dec 27 '24 01:12 hunghaoti

Could you provide the log file in build/Testing/Temporary/LastTest.log when you run these Failed test?

Sure. Please chech this: https://drive.google.com/drive/folders/1FenrFwgLBsqv25icR_U7alEkVcpRe3Mt

jysh1214 avatar Dec 28 '24 13:12 jysh1214