ucx
ucx copied to clipboard
AZP: Add amd rocm hosts to test
Add AMD ROCM testing
@avildema
CCLD test_memhooks
/bin/ld: cannot find -ludev
collect2: error: ld returned 1 exit status
seems like missing dependency on the host. i guess it comes from hpcx-gcc module. need to install libudev-devel or similar.
Is there anything I can do to help with this task?
Is there anything I can do to help with this task?
No, Need just re-run I have done
@edgargabriel can you pls check the rocm build failures?
The compilation problems seen on the rocm workers are due to https://github.com/openucx/ucx/pull/8321, once that is merged it should compile. To pass the gtests, pr's #8275 will also be required.
/azp run
Azure Pipelines successfully started running 3 pipeline(s).
I had a look at the errors in the rocm workers, it is not clear to me whether all of them are rocm related issues: For rocm worker 0 the error is from :
[swx-rdmz-instinct02:2015177:0:2015177] wireup.c:1388 Fatal: endpoint reconfiguration not supported yet
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]
Worker 1 might be rocm related, I will have a look later this week into that, the stack trace is
stdbuf -e0 -o0 /scrap/azure/agent-04/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest --gtest_filter=malloc_hook_cplusplus.mallopt
[swx-rdmz-instinct01:1789375:0:1789375] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
Worker 2 is
[ RUN ] rc/test_ucp_tag_match_rndv.bidir_multi_exp_post/3 <rc_v/put_zcopy,proto>
[swx-rdmz-instinct02:2012276:0:2012276] rc_verbs_iface.c:120 send completion with error: Work Request Flushed Error [qpn 0x1bea6 wrid 0xa1vendor_err 0xf4]
[swx-rdmz-instinct02:2012276:0:2012276] rc_verbs_iface.c:120 [rqpn 0x1bea5 dlid=0 sl=0 port=1 src_path_bits=0 dgid=fe80::e42:a1ff:fe75:1a57 sgid_index=0 traffic_class=0]
/scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/uct/ib/rc/verbs/rc_verbs_iface.c: [ uct_rc_verbs_handle_failure() ]