Enrico Minack
Enrico Minack
You already did: https://github.com/apache/incubator-mxnet/issues/20936
I'd recommend renaming the issue, simplifying the context and specifying the problem helps to get someone looking into it: Building MXNet 1.9 from sources breaks `mxnet.libinfo.find_include_path()`
@shirosankaku @mshiryaev do you want to look into adding oneccl test coverage to our CI?
@i-kosarev glad to hear that, I managed to get the OneCCL test image build on GitHub, so we can put it back into the CI. But some tests fail (5)...
As @francares [pointed out](https://github.com/horovod/horovod/pull/3492#issuecomment-1082020956), OneCCL should use a fixed version (e.g. https://github.com/oneapi-src/oneCCL/tree/2021.5.2), not master.
@richardliaw @tgaddair the `check_resources` method seems to be very flaky (see 1. above), I am going to remove it entirely in #3430: https://github.com/horovod/horovod/pull/3430/files#diff-142c833c54b6f513791b91a64842d417cb4025f79afcd0eca791eefdd2d2847fL71-L79 If you feel like this test is...
Looks like `RayExecutor` produces a broken `CUDA_VISIBLE_DEVICES`: https://buildkite.com/horovod/horovod/builds/7308#461d92d2-110c-4539-ab04-703f49478c52/231-323 ``` > assert len(all_envs[0]["CUDA_VISIBLE_DEVICES"].split(",")) == 4, all_envs[0]["CUDA_VISIBLE_DEVICES"] E AssertionError: 0,3,1,2,1,2,0,3 E assert 8 == 4 E +8 E -4 ```
@ashahab @amogkam @richardliaw any idea why this: ```python os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" address_info = ray.init(num_cpus=4, num_gpus=4) setting = RayExecutor.create_settings(timeout_s=30) hjob = RayExecutor( setting, num_hosts=1, num_workers_per_host=4, use_gpu=True) hjob.start() all_envs = hjob.execute(lambda _:...
@ashahab can you take a look please?
Will this slow down tests?