perftest Add support for Neuron and HabanaLabs devices

This PR adds abstraction layer for supporting various AI HW accelerators.

Add support for Neuron accelerator device memory usage. To support Neuron, configure using --enable-neuron --with-neuron=<path/to/neuron/base>. Run bandwidth tests with Neuron direct by specifying --use_neuron option.

Add support for Habana Labs accelerators device memory usage. To support Habana Labs, configure using --enable-hl --with-hl=<habanalabs/installation/prefix> Run bandwidth tests with Habana Labs direct by specifying --use_hl <pci_bus_id>.

Jul 07 '22 13:07 mrgolin

Hi @mrgolin , thanks for the pull request! can you please add informative commit message ?

Jul 12 '22 07:07 HassanKhadour

@HassanKhadour Thanks for taking a look! I've updated the commit message to better explain the change and its use case.

Jul 12 '22 09:07 mrgolin

Copy-pasting the same concepts every time a new accelerator is introduced isn't very scalable, I would argue it's time to abstract these and have specific "hardware providers" implementations.

Jul 31 '22 11:07 gal-pressman

Copy-pasting the same concepts every time a new accelerator is introduced isn't very scalable, I would argue it's time to abstract these and have specific "hardware providers" implementations.

+1, abstracting out the accelerator alloc/copy routines would make things maintainable for the individual benchmarks going forward.

Aug 02 '22 17:08 rajachan

@galpress @rajachan Agree it's about time for refactoring HW accelerators memory usage code. I'll take this on.

Aug 10 '22 17:08 mrgolin

I've added abstraction for various memory types as well as support for Habana Labs AI accelerators. @galpress @rajachan @HassanKhadour I would appreciate your review.

Sep 18 '22 10:09 mrgolin

@HassanKhadour Can you please share failing check results?

Sep 18 '22 11:09 mrgolin

Thanks Michael for your commit, Will try to review ASAP. following the cppcheck failures:

Perftest_memory

Sep 18 '22 15:09 HassanKhadour

Hi Michael, following the failure:

src/host_memory.c:68:26: style: Unused variable: host_ctx [unusedVariable] 12:14:56 struct host_memory_ctx *host_ctx;

Sep 19 '22 09:09 HassanKhadour

The check has been passed, reported wrong status, fixed it

Sep 19 '22 13:09 HassanKhadour

Thanks!

Sep 19 '22 13:09 mrgolin

@galpress @rajachan @HassanKhadour any comments will be appreciated.

Oct 23 '22 09:10 mrgolin

How is that this PR is still waiting for review after we've been asked to refactor perftest accelerators memory support and at the same time commits are pushed directly to master adding a new memory type (Cuda dma-buf) in an old way, blowing the code, making it unmaintainable (exactly as @gal-pressman and @rajachan mentioned) and making this PR unmergeable?

https://github.com/linux-rdma/perftest/commit/8b666479b6b6a18c1e9b9d6b8b0d3e82e0f677a1 https://github.com/linux-rdma/perftest/commit/e589f4cc658611796ef491d326ab1faa70d14315 https://github.com/linux-rdma/perftest/commit/b0823b7000c9130396a07749d692b730120dcb82 https://github.com/linux-rdma/perftest/commit/31159151cf90737ddd31b49fd1c412016a427978

Nov 24 '22 10:11 mrgolin

Hi mrgolin,

We are sorry we weren't in the loop of the review from the beginning, we apologize about the inconvenience that is caused. I want to ask @gal-pressman @rajachan who were in the loop if it would be okay if they rereview, I don't have any problem with merging this feature more than running few basic test in order to test its performance although I don't see any code that can possibly cause performance difference. once having the needed review we can merge it and the dma_buf will be handled by me and @sshaulnv

Nov 24 '22 11:11 HassanKhadour

I merely suggested it doesn't make sense that the same code is copy-pasted again and again. Since you're already pushing other code that does it, might as well merge this one as well..

Nov 24 '22 11:11 gal-pressman

I merely suggested it doesn't make sense that the same code is copy-pasted again and again. Since you're already pushing other code that does it, might as well merge this one as well..

Gal, I actually did what you suggested (See https://github.com/linux-rdma/perftest/pull/155/commits/457fc467ec70a3c4680a1444f71fac48b0968aab) and will appreciate if you can review this again.

Nov 24 '22 12:11 mrgolin

Thanks for your contribution. The rdma_cm is damaged, getting segmentation fault after running with it. Can you please fix it and address the comments ?

Thanks for your review. We are currently not using rdma_cm, can you please share dmesg and any additional info that might help find the issue.

Dec 21 '22 12:12 mrgolin

Thanks for your contribution. The rdma_cm is damaged, getting segmentation fault after running with it. Can you please fix it and address the comments ?

Thanks for your review. We are currently not using rdma_cm, can you please share dmesg and any additional info that might help find the issue.

The segfault occurs in Client side. I'll explain the flow that causes the segfault - e.g. for ib_write_bw: establish_connection (line 120) --> rdma_client_connect --> ctx_init --> ctx->memory->init(ctx->memory)

But the ctx->memory still isn't initiated in this stage.

Dec 27 '22 08:12 HassanKhadour

Thanks for your contribution. The rdma_cm is damaged, getting segmentation fault after running with it. Can you please fix it and address the comments ?

Thanks for your review. We are currently not using rdma_cm, can you please share dmesg and any additional info that might help find the issue.

The segfault occurs in Client side. I'll explain the flow that causes the segfault - e.g. for ib_write_bw: establish_connection (line 120) --> rdma_client_connect --> ctx_init --> ctx->memory->init(ctx->memory)

But the ctx->memory still isn't initiated in this stage.

Thanks for specifying the error flow. Fixed.

Jan 09 '23 16:01 mrgolin

@HassanKhadour we've rebased this PR on Cuda dmabuf support changes and adopted it to our refactoring.

Jan 09 '23 16:01 mrgolin

@HassanKhadour Can you please share failing checks?

Jan 19 '23 13:01 mrgolin

Jan 19 '23 16:01 HassanKhadour

All checks have failed

@HassanKhadour please elaborate.

Jan 22 '23 11:01 mrgolin

Jan 22 '23 13:01 HassanKhadour

the use_cuda with rdma_cm is damaged, getting segmentation fault.

Jan 25 '23 13:01 sshaulnv

the use_cuda with rdma_cm is damaged, getting segmentation fault.

Thanks! I fixed a bug in perftest_communication, please report if there are other related issues.

Jan 29 '23 18:01 mrgolin

@HassanKhadour If there are no additional comments can you please merge this PR?

Feb 06 '23 16:02 mrgolin

Hi @mrgolin , I had some additional comments on the code. Sure we are working on merging the change and we have set a plan for it, but as you know its a huge change that touch all perftest code and we are trying to test everything in it. Thanks

Feb 06 '23 16:02 HassanKhadour

Hi @mrgolin , I had some additional comments on the code. Sure we are working on merging the change and we have set a plan for it, but as you know its a huge change that touch all perftest code and we are trying to test everything in it. Thanks

Thanks! Please keep me looped and let me know if I can help anyhow.

Feb 06 '23 16:02 mrgolin

Hi @mrgolin, Appreciate your help in testing the ROCm changes on devices that support ROCm & other vendor devices that supports RDMA, from our side I'm testing on Nvidia devices. Also can u please answer the comments above?

Thanks

Feb 08 '23 12:02 HassanKhadour

perftest perftest copied to clipboard

Add support for Neuron and HabanaLabs devices

perftest
perftest copied to clipboard