pytorch Docker error when running tests: ninja is required

🐛 Bug

Hi,

I'm trying to use pytorch with rocm, however after installing the docker image the tests fail with the message:

Executing ['/usr/bin/python3.6', 'test_cpp_api_parity.py'] ... [2020-09-10 11:45:29.112015]
Traceback (most recent call last):
  File "test_cpp_api_parity.py", line 56, in <module>
    module_impl_check.build_cpp_tests(TestCppApiParity, print_cpp_source=PRINT_CPP_SOURCE)
  File "/root/pytorch/test/cpp_api_parity/module_impl_check.py", line 297, in build_cpp_tests
    functions=functions)
  File "/root/pytorch/test/cpp_api_parity/utils.py", line 148, in compile_cpp_code_inline
    verbose=False,
  File "/root/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1130, in load_inline
    keep_intermediates=keep_intermediates)
  File "/root/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1185, in _jit_compile
    with_cuda=with_cuda)
  File "/root/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1252, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/root/.local/lib/python3.6/site-packages/torch/utils/cpp_extension.py", line 1308, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions
Traceback (most recent call last):
  File "test/run_test.py", line 716, in <module>
    main()
  File "test/run_test.py", line 705, in main
    raise RuntimeError(err)
RuntimeError: test_cpp_api_parity failed!

To Reproduce

Steps to reproduce the behavior:

sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 8G rocm/pytorch:latest
(inside the docker) PYTORCH_TEST_WITH_ROCM=1 python3.6 test/run_test.py

Expected behavior

All the tests should run without errors.

Environment

(out of the docker) PyTorch version: 1.7.0a0+8acce55 Is debug build: False CUDA used to build PyTorch: Could not collect

OS: Pop!_OS 20.04 LTS (x86_64) GCC version: (Ubuntu 9.3.0-10ubuntu2) 9.3.0 Clang version: 10.0.0-4ubuntu1 CMake version: version 3.16.3

Python version: 3.8 (64-bit runtime) Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect

Versions of relevant libraries: [pip3] numpy==1.19.1 [pip3] torch==1.7.0a0+8acce55 [conda] Could not collect

Sep 10 '20 11:09 apalazzi

Thank you @apalazzi for bringing this to our attention. There is indeed a missing dependency in the docker you are using. Please install ninja in the docker using the command pip3.6 install ninja and then try the test, and it should work. We will add the broken dependency in the docker

Sep 10 '20 21:09 ashishfarmer

Hi,

The lack of ninja is still in docker image from rocm/pytorch:latest at this moment.
After install ninja, the test report this error:

FAIL: test_torch_nn_MSELoss_prec_cuda (main.TestCppApiParity)

Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/cpp_api_parity/module_impl_check.py", line 251, in test_fn unit_test_class=self, test_params=unit_test_class.module_test_params_map[self._testMethodName]) File "/var/lib/jenkins/pytorch/test/cpp_api_parity/module_impl_check.py", line 181, in test_forward_backward run_cpp_test_fn_and_check_output() File "/var/lib/jenkins/pytorch/test/cpp_api_parity/module_impl_check.py", line 155, in run_cpp_test_fn_and_check_output msg=generate_error_msg("forward output", cpp_output, python_output)) File "/root/.local/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1144, in assertEqual self.assertTrue(result, msg=msg) AssertionError: False is not true : Parity test failed: forward output in C++ has value: 0.00752826314419508, which does not match the corresponding value in Python: 0.0076470584608614445.

I used rocm 4.0 and ubuntu18.04.5

Jan 21 '21 10:01 wenwu-glagle

Hi @wenwu-glagle , what's the GPU you have been using to execute the unit test?

Mar 01 '21 16:03 sunway513

hi sunway513, I used vega56

Mar 18 '21 02:03 wenwu-glagle

@ROCmSupport, can you help reproduce the reported issue locally? Thanks.

May 22 '21 18:05 sunway513