Docker error when running tests: ninja is required
🐛 Bug
Hi,
I'm trying to use pytorch with rocm, however after installing the docker image the tests fail with the message:
Executing ['/usr/bin/python3.6', 'test_cpp_api_parity.py'] ... [2020-09-10 11:45:29.112015]
Traceback (most recent call last):
File "test_cpp_api_parity.py", line 56, in <module>
module_impl_check.build_cpp_tests(TestCppApiParity, print_cpp_source=PRINT_CPP_SOURCE)
File "/root/pytorch/test/cpp_api_parity/module_impl_check.py", line 297, in build_cpp_tests
functions=functions)
File "/root/pytorch/test/cpp_api_parity/utils.py", line 148, in compile_cpp_code_inline
verbose=False,
File "/root/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1130, in load_inline
keep_intermediates=keep_intermediates)
File "/root/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1185, in _jit_compile
with_cuda=with_cuda)
File "/root/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1252, in _write_ninja_file_and_build_library
verify_ninja_availability()
File "/root/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1308, in verify_ninja_availability
raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions
Traceback (most recent call last):
File "test/run_test.py", line 716, in <module>
main()
File "test/run_test.py", line 705, in main
raise RuntimeError(err)
RuntimeError: test_cpp_api_parity failed!
To Reproduce
Steps to reproduce the behavior:
- sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
- (inside the docker) PYTORCH_TEST_WITH_ROCM=1 python3.6 test/run_test.py
Expected behavior
All the tests should run without errors.
Environment
(out of the docker) PyTorch version: 1.7.0a0+8acce55 Is debug build: False CUDA used to build PyTorch: Could not collect
OS: Pop!_OS 20.04 LTS (x86_64) GCC version: (Ubuntu 9.3.0-10ubuntu2) 9.3.0 Clang version: 10.0.0-4ubuntu1 CMake version: version 3.16.3
Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect
Versions of relevant libraries: [pip3] numpy==1.19.1 [pip3] torch==1.7.0a0+8acce55 [conda] Could not collect
Thank you @apalazzi for bringing this to our attention. There is indeed a missing dependency in the docker you are using. Please install ninja in the docker using the command pip3.6 install ninja and then try the test, and it should work.
We will add the broken dependency in the docker
Hi,
- The lack of ninja is still in docker image from rocm/pytorch:latest at this moment.
- After install ninja, the test report this error:
FAIL: test_torch_nn_MSELoss_prec_cuda (main.TestCppApiParity)
Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/cpp_api_parity/module_impl_check.py", line 251, in test_fn unit_test_class=self, test_params=unit_test_class.module_test_params_map[self._testMethodName]) File "/var/lib/jenkins/pytorch/test/cpp_api_parity/module_impl_check.py", line 181, in test_forward_backward run_cpp_test_fn_and_check_output() File "/var/lib/jenkins/pytorch/test/cpp_api_parity/module_impl_check.py", line 155, in run_cpp_test_fn_and_check_output msg=generate_error_msg("forward output", cpp_output, python_output)) File "/root/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1144, in assertEqual self.assertTrue(result, msg=msg) AssertionError: False is not true : Parity test failed: forward output in C++ has value: 0.00752826314419508, which does not match the corresponding value in Python: 0.0076470584608614445.
- I used rocm 4.0 and ubuntu18.04.5
Hi @wenwu-glagle , what's the GPU you have been using to execute the unit test?
hi sunway513, I used vega56
@ROCmSupport, can you help reproduce the reported issue locally? Thanks.