Edgar Gabriel

Results 137 comments of Edgar Gabriel
trafficstars

@simonbyrne can you please try with the ucx 1.13.0-rc1 release? We fixed an issue with ipc creation, (although we didn't test on ROCm 3.x, but we did on ROCm 4.5...

@bosilca, I like the idea, but I am wandering whether we will need maybe both, a regular memcpy and a vector_copy operation. I am thinking about integration with e.g. UCX,...

@bwbarrett I am more worried about the fact that the vector_copy will have different signature than the memcpy function, and if e.g. UCX has to be adjusted to use this...

actually, I take it back, maybe its not an issue. They are invoking pml_ucx_generic_datatype_pack/unpack which then calls opal_convertor_pack/unpack, and that interface would probably not change.

Is there anything I can do to help with this task?

The compilation problems seen on the rocm workers are due to https://github.com/openucx/ucx/pull/8321, once that is merged it should compile. To pass the gtests, pr's #8275 will also be required.

I had a look at the errors in the rocm workers, it is not clear to me whether all of them are rocm related issues: For rocm worker 0 the...

Can you try two things as well to change some settings for the parallel I/O part? 1. Force using a different fcoll component, e.g. export OMPI_MCA_fcoll=dynamic 2. Force using a...

ok, thank you. Maybe you can use right now this last flag as a workaround. Is there a way to reproduce this issue with a smaller process count as well?...

@greole I can unfortunately not reproduce the bug, all gtests for rocm work on my setup with rocm 5.1.1. The bug that you were pointing at as a related issues...