ucx icon indicating copy to clipboard operation
ucx copied to clipboard

AZP: Add amd rocm hosts to test

Open avildema opened this issue 3 years ago • 8 comments

Add AMD ROCM testing

avildema avatar Apr 29 '22 10:04 avildema

@avildema

  CCLD     test_memhooks
/bin/ld: cannot find -ludev
collect2: error: ld returned 1 exit status

seems like missing dependency on the host. i guess it comes from hpcx-gcc module. need to install libudev-devel or similar.

yosefe avatar Apr 29 '22 12:04 yosefe

Is there anything I can do to help with this task?

edgargabriel avatar Jun 15 '22 16:06 edgargabriel

Is there anything I can do to help with this task?

No, Need just re-run I have done

avildema avatar Jun 15 '22 17:06 avildema

@edgargabriel can you pls check the rocm build failures?

yosefe avatar Jun 16 '22 07:06 yosefe

The compilation problems seen on the rocm workers are due to https://github.com/openucx/ucx/pull/8321, once that is merged it should compile. To pass the gtests, pr's #8275 will also be required.

edgargabriel avatar Jun 16 '22 12:06 edgargabriel

/azp run

yosefe avatar Jun 18 '22 11:06 yosefe

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines[bot] avatar Jun 18 '22 11:06 azure-pipelines[bot]

I had a look at the errors in the rocm workers, it is not clear to me whether all of them are rocm related issues: For rocm worker 0 the error is from :

[swx-rdmz-instinct02:2015177:0:2015177]      wireup.c:1388 Fatal: endpoint reconfiguration not supported yet
/scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../src/ucp/wireup/wireup.c: [ ucp_wireup_init_lanes() ]

Worker 1 might be rocm related, I will have a look later this week into that, the stack trace is

stdbuf -e0 -o0 /scrap/azure/agent-04/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest --gtest_filter=malloc_hook_cplusplus.mallopt 
[swx-rdmz-instinct01:1789375:0:1789375] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))

Worker 2 is

[ RUN      ] rc/test_ucp_tag_match_rndv.bidir_multi_exp_post/3 <rc_v/put_zcopy,proto>
[swx-rdmz-instinct02:2012276:0:2012276] rc_verbs_iface.c:120  send completion with error: Work Request Flushed Error [qpn 0x1bea6 wrid 0xa1vendor_err 0xf4]
[swx-rdmz-instinct02:2012276:0:2012276] rc_verbs_iface.c:120  [rqpn 0x1bea5 dlid=0 sl=0 port=1 src_path_bits=0 dgid=fe80::e42:a1ff:fe75:1a57 sgid_index=0 traffic_class=0]

/scrap/azure/agent-05/AZP_WORKSPACE/2/s/contrib/../src/uct/ib/rc/verbs/rc_verbs_iface.c: [ uct_rc_verbs_handle_failure() ]

edgargabriel avatar Jul 11 '22 15:07 edgargabriel