perftest
perftest copied to clipboard
Add support for Neuron and HabanaLabs devices
This PR adds abstraction layer for supporting various AI HW accelerators.
Add support for Neuron accelerator device memory usage.
To support Neuron, configure using --enable-neuron --with-neuron=<path/to/neuron/base>.
Run bandwidth tests with Neuron direct by specifying --use_neuron
Add support for Habana Labs accelerators device memory usage. To support Habana Labs, configure using --enable-hl --with-hl=<habanalabs/installation/prefix> Run bandwidth tests with Habana Labs direct by specifying --use_hl <pci_bus_id>.
Hi @mrgolin , thanks for the pull request! can you please add informative commit message ?
@HassanKhadour Thanks for taking a look! I've updated the commit message to better explain the change and its use case.
Copy-pasting the same concepts every time a new accelerator is introduced isn't very scalable, I would argue it's time to abstract these and have specific "hardware providers" implementations.
Copy-pasting the same concepts every time a new accelerator is introduced isn't very scalable, I would argue it's time to abstract these and have specific "hardware providers" implementations.
+1, abstracting out the accelerator alloc/copy routines would make things maintainable for the individual benchmarks going forward.
@galpress @rajachan Agree it's about time for refactoring HW accelerators memory usage code. I'll take this on.
I've added abstraction for various memory types as well as support for Habana Labs AI accelerators. @galpress @rajachan @HassanKhadour I would appreciate your review.
@HassanKhadour Can you please share failing check results?
Thanks Michael for your commit, Will try to review ASAP. following the cppcheck failures:
Hi Michael, following the failure:
src/host_memory.c:68:26: style: Unused variable: host_ctx [unusedVariable] 12:14:56 struct host_memory_ctx *host_ctx;
The check has been passed, reported wrong status, fixed it
Thanks!
@galpress @rajachan @HassanKhadour any comments will be appreciated.
How is that this PR is still waiting for review after we've been asked to refactor perftest accelerators memory support and at the same time commits are pushed directly to master adding a new memory type (Cuda dma-buf) in an old way, blowing the code, making it unmaintainable (exactly as @gal-pressman and @rajachan mentioned) and making this PR unmergeable?
https://github.com/linux-rdma/perftest/commit/8b666479b6b6a18c1e9b9d6b8b0d3e82e0f677a1 https://github.com/linux-rdma/perftest/commit/e589f4cc658611796ef491d326ab1faa70d14315 https://github.com/linux-rdma/perftest/commit/b0823b7000c9130396a07749d692b730120dcb82 https://github.com/linux-rdma/perftest/commit/31159151cf90737ddd31b49fd1c412016a427978
Hi mrgolin,
We are sorry we weren't in the loop of the review from the beginning, we apologize about the inconvenience that is caused. I want to ask @gal-pressman @rajachan who were in the loop if it would be okay if they rereview, I don't have any problem with merging this feature more than running few basic test in order to test its performance although I don't see any code that can possibly cause performance difference. once having the needed review we can merge it and the dma_buf will be handled by me and @sshaulnv
I merely suggested it doesn't make sense that the same code is copy-pasted again and again. Since you're already pushing other code that does it, might as well merge this one as well..
I merely suggested it doesn't make sense that the same code is copy-pasted again and again. Since you're already pushing other code that does it, might as well merge this one as well..
Gal, I actually did what you suggested (See https://github.com/linux-rdma/perftest/pull/155/commits/457fc467ec70a3c4680a1444f71fac48b0968aab) and will appreciate if you can review this again.
Thanks for your contribution. The rdma_cm is damaged, getting segmentation fault after running with it. Can you please fix it and address the comments ?
Thanks for your review. We are currently not using rdma_cm, can you please share dmesg and any additional info that might help find the issue.
Thanks for your contribution. The rdma_cm is damaged, getting segmentation fault after running with it. Can you please fix it and address the comments ?
Thanks for your review. We are currently not using rdma_cm, can you please share dmesg and any additional info that might help find the issue.
The segfault occurs in Client side. I'll explain the flow that causes the segfault - e.g. for ib_write_bw: establish_connection (line 120) --> rdma_client_connect --> ctx_init --> ctx->memory->init(ctx->memory)
But the ctx->memory still isn't initiated in this stage.
Thanks for your contribution. The rdma_cm is damaged, getting segmentation fault after running with it. Can you please fix it and address the comments ?
Thanks for your review. We are currently not using rdma_cm, can you please share dmesg and any additional info that might help find the issue.
The segfault occurs in Client side. I'll explain the flow that causes the segfault - e.g. for ib_write_bw: establish_connection (line 120) --> rdma_client_connect --> ctx_init --> ctx->memory->init(ctx->memory)
But the ctx->memory still isn't initiated in this stage.
Thanks for specifying the error flow. Fixed.
@HassanKhadour we've rebased this PR on Cuda dmabuf support changes and adopted it to our refactoring.
@HassanKhadour Can you please share failing checks?
All checks have failed
@HassanKhadour please elaborate.
the use_cuda with rdma_cm is damaged, getting segmentation fault.
the use_cuda with rdma_cm is damaged, getting segmentation fault.
Thanks! I fixed a bug in perftest_communication, please report if there are other related issues.
@HassanKhadour If there are no additional comments can you please merge this PR?
Hi @mrgolin , I had some additional comments on the code. Sure we are working on merging the change and we have set a plan for it, but as you know its a huge change that touch all perftest code and we are trying to test everything in it. Thanks
Hi @mrgolin , I had some additional comments on the code. Sure we are working on merging the change and we have set a plan for it, but as you know its a huge change that touch all perftest code and we are trying to test everything in it. Thanks
Thanks! Please keep me looped and let me know if I can help anyhow.
Hi @mrgolin, Appreciate your help in testing the ROCm changes on devices that support ROCm & other vendor devices that supports RDMA, from our side I'm testing on Nvidia devices. Also can u please answer the comments above?
Thanks