Build CI CD for gpu.
Currently, we only test cpu version for every commit. We may need to add for gpu version (maybe running test on local machine.)
There is a way to do GPU testing for CI but I don't think we have bandwidth to do that
We are bringing in some help.
Hi, I would like to contribute this part.
The related PR will use nvidia/cuda docker image.
What's the CUDA version we intent to use in CI?
Let's start with 11.8? ideally we want to upgrade to 12 fully
On Mon, Dec 16, 2024, 08:30 Alex Chiang @.***> wrote:
Hi, I would like to contribute this part.
The related PR will use nvidia/cuda docker image. What's the CUDA version we intent to use in CI?
— Reply to this email directly, view it on GitHub https://github.com/Cytnx-dev/Cytnx/issues/538#issuecomment-2545639139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFCX3SKPU3JO4DRXSVUUUXD2F3IXRAVCNFSM6AAAAABTTIAKWOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBVGYZTSMJTHE . You are receiving this because you commented.Message ID: @.***>
I think we are using ubuntu 20 or 22, right?
22 would be nice. Also some DevOps help on setting up GitHub CI actions of that would be great.
On Mon, Dec 16, 2024, 08:44 Alex Chiang @.***> wrote:
I think we are using ubuntu 20 or 22, right?
— Reply to this email directly, view it on GitHub https://github.com/Cytnx-dev/Cytnx/issues/538#issuecomment-2545669352, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFCX3SNFHNE73EBYCEKWJW32F3KKTAVCNFSM6AAAAABTTIAKWOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBVGY3DSMZVGI . You are receiving this because you commented.Message ID: @.***>
@kaihsin
I try to build and test in the nvidia container (nvidia/cuda:11.8.0-devel-ubuntu22.04).
Here is my config:
cmake -B $CYTNX_BUILD -G Ninja \
-D CMAKE_BUILD_TYPE=Debug \
-D Python_EXECUTABLE="$(which python3.11)" \
-D Python_INTERPRETER="$(which python3.11)" \
-D Python3_EXECUTABLE="$(which python3.11)" \
-D Python3_INTERPRETER="$(which python3.11)" \
-D BLAS_LIBRARIES="$OPENBLAS_BLAS_LIB" \
-D LAPACK_LIBRARIES="$OPENBLAS_LAPACKE_LIB" \
-D USE_MKL=OFF \
-D USE_CUDA=ON \
-D USE_CUTENSOR=OFF \
-D USE_CUQUANTUM=OFF \
-D USE_CUTT=OFF \
-D USE_MAGMA=OFF \
-D RUN_TESTS=ON
The 43 tests failed. Is It expected?
Can the tests pass with only USE_CUDA enabled?
95% tests passed, 43 tests failed out of 891
Total Test time (real) = 413.69 sec
The following tests FAILED:
640 - NetworkTest.gpu_Network_dense_no_order (Failed)
641 - NetworkTest.gpu_Network_dense_find_optimal (Failed)
642 - NetworkTest.gpu_Network_dense_order_line (Failed)
643 - NetworkTest.gpu_Network_dense_specified_order (Failed)
644 - NetworkTest.gpu_Network_dense_reuse (Failed)
646 - NconTest.gpu_ncon_default_order (Failed)
647 - NconTest.gpu_ncon_specified_order (Failed)
648 - NconTest.gpu_ncon_optimal_order (Failed)
649 - ContractTest.gpu_Contract_denseUt_optimal_order (Failed)
650 - ContractTest.gpu_Contract_denseUt_default_order (Failed)
651 - ContractTest.gpu_Contract_denseUt_specified_order (Failed)
652 - ContractTest.gpu_Contract_denseUt_optimal_specified_order (Failed)
653 - ContractTest.gpu_Contracts_denseUt_optimal_order (Failed)
654 - ContractTest.gpu_Contracts_denseUt_default_order (Failed)
655 - ContractTest.gpu_Contracts_denseUt_specified_order (Failed)
656 - ContractTest.gpu_Contracts_denseUt_optimal_specified_order (Failed)
657 - BlockUniTensorTest.gpu_Trace (Failed)
706 - BlockUniTensorTest.gpu_group_basis (SEGFAULT)
710 - DenseUniTensorTest.gpu_Trace (Failed)
775 - Directsum.gpu_one_elem_tens (SEGFAULT)
779 - Directsum.gpu_shared_axis_contains_all_tens_one_elem (SEGFAULT)
789 - ExpM.gpu_ExpM_test (Failed)
791 - Lanczos_Gnd.gpu_Bk_Lanczos_Gnd_test (Failed)
792 - Arnoldi.gpu_which_LM_test (Failed)
793 - Arnoldi.gpu_which_LR_test (Failed)
794 - Arnoldi.gpu_which_LI_test (Failed)
795 - Arnoldi.gpu_which_SM_test (Failed)
796 - Arnoldi.gpu_which_SR_test (Failed)
797 - Arnoldi.gpu_which_SI_test (Failed)
798 - Arnoldi.gpu_mat_type_real_test (Failed)
799 - Arnoldi.gpu_k1_test (Failed)
800 - Arnoldi.gpu_k_max (Failed)
801 - Arnoldi.gpu_smallest_dim (Failed)
807 - Svd.gpu_dense_one_elem (Failed)
808 - Svd.gpu_dense_nondiag_test (Failed)
813 - Svd.gpu_U1_zeros_test (Failed)
820 - Gesvd.gpu_dense_one_elem (Failed)
821 - Gesvd.gpu_dense_nondiag_test (Failed)
835 - linalg_Test.gpu_BkUt_Svd_truncate3 (Failed)
836 - linalg_Test.gpu_BkUt_Qr1 (Failed)
838 - linalg_Test.gpu_BkUt_expM (Failed)
839 - linalg_Test.gpu_DenseUt_Gesvd_truncate (Failed)
840 - linalg_Test.gpu_DenseUt_Svd_truncate (Failed)
BTW, all tests passed when USE_CUDA=OFF.
If you built on the master branch, it's expected. If you build on the dev-master branch there are fewer falling tests but still a few.
I got it. I think we should list all the failed tests and plan to fix them in the future. The list will be included in the PR. Does that make sense?
Which branch should I submit my pr to? master or dev-master?
BTW, I cannot build dev-master successfully. Is It expected?
We can discuss this issues on another thread.
I think dev-master is fine. Could you try to use cuda:12.2 to build?
These tests are expected to fail on the develop branch.
linalg_Test.gpu_BkUt_Svd_truncate3 linalg_Test.gpu_BkUt_expM Svd.gpu_U1_zeros_test Lanczos_Gnd.gpu_Bk_Lanczos_Gnd_test
I think dev-master is fine. Could you try to use cuda:12.2 to build?
I builded Cytnx with cuda:12.6.3 successfully.
These tests are expected to fail on the develop branch.
3 extra tests failed:
The following tests FAILED:
972 - Directsum.gpu_one_elem_tens (SEGFAULT)
976 - Directsum.gpu_shared_axis_contains_all_tens_one_elem (SEGFAULT)
1100 - BlockUniTensorTest.gpu_group_basis (SEGFAULT)
My commit: https://github.com/jysh1214/Cytnx/commit/50aa8fe2e18738e3e972deca921b4a7c71a2cb54
I can't copy the whole environment. However, you may try these changes to suppress the segmentation fault error.
To build in the debug mode, the compiler flag -D USE_DEBUG=ON should be set even if CMAKE_BUILD_TYPE is set to Debug. The env value ASAN_OPTIONS="protect_shadow_gap=0:replace_intrin=0:detect_leaks=0" should be set while running tests. Refer to:
https://github.com/google/sanitizers/issues/629#issuecomment-161755902
https://stackoverflow.com/a/68027496
I still encountered some errors:
The following tests FAILED:
174 - SearchTreeTest.BasicSearchOrder2 (Failed)
972 - Directsum.gpu_one_elem_tens (Failed)
976 - Directsum.gpu_shared_axis_contains_all_tens_one_elem (Failed)
1100 - BlockUniTensorTest.gpu_group_basis (Failed)
I will try to fix them later.
My commit: https://github.com/jysh1214/Cytnx/commit/58d1473060c8587d6866f0cfca755d35faf029a2
I still encountered some errors:
The following tests FAILED: 174 - SearchTreeTest.BasicSearchOrder2 (Failed) 972 - Directsum.gpu_one_elem_tens (Failed) 976 - Directsum.gpu_shared_axis_contains_all_tens_one_elem (Failed) 1100 - BlockUniTensorTest.gpu_group_basis (Failed)I will try to fix them later.
My commit: jysh1214@58d1473
Could you provide the log file in build/Testing/Temporary/LastTest.log when you run these Failed test?
Could you provide the log file in build/Testing/Temporary/LastTest.log when you run these Failed test?
Sure. Please chech this: https://drive.google.com/drive/folders/1FenrFwgLBsqv25icR_U7alEkVcpRe3Mt