perftest icon indicating copy to clipboard operation
perftest copied to clipboard

Add support for Neuron and HabanaLabs devices

Open mrgolin opened this issue 2 years ago • 11 comments

This PR adds abstraction layer for supporting various AI HW accelerators.

Add support for Neuron accelerator device memory usage. To support Neuron, configure using --enable-neuron --with-neuron=<path/to/neuron/base>. Run bandwidth tests with Neuron direct by specifying --use_neuron option.

Add support for Habana Labs accelerators device memory usage. To support Habana Labs, configure using --enable-hl --with-hl=<habanalabs/installation/prefix> Run bandwidth tests with Habana Labs direct by specifying --use_hl <pci_bus_id>.

mrgolin avatar Jul 07 '22 13:07 mrgolin

Hi @mrgolin , thanks for the pull request! can you please add informative commit message ?

HassanKhadour avatar Jul 12 '22 07:07 HassanKhadour

@HassanKhadour Thanks for taking a look! I've updated the commit message to better explain the change and its use case.

mrgolin avatar Jul 12 '22 09:07 mrgolin

Copy-pasting the same concepts every time a new accelerator is introduced isn't very scalable, I would argue it's time to abstract these and have specific "hardware providers" implementations.

gal-pressman avatar Jul 31 '22 11:07 gal-pressman

Copy-pasting the same concepts every time a new accelerator is introduced isn't very scalable, I would argue it's time to abstract these and have specific "hardware providers" implementations.

+1, abstracting out the accelerator alloc/copy routines would make things maintainable for the individual benchmarks going forward.

rajachan avatar Aug 02 '22 17:08 rajachan

@galpress @rajachan Agree it's about time for refactoring HW accelerators memory usage code. I'll take this on.

mrgolin avatar Aug 10 '22 17:08 mrgolin

I've added abstraction for various memory types as well as support for Habana Labs AI accelerators. @galpress @rajachan @HassanKhadour I would appreciate your review.

mrgolin avatar Sep 18 '22 10:09 mrgolin

@HassanKhadour Can you please share failing check results?

mrgolin avatar Sep 18 '22 11:09 mrgolin

Thanks Michael for your commit, Will try to review ASAP. following the cppcheck failures:

Perftest_memory

HassanKhadour avatar Sep 18 '22 15:09 HassanKhadour

Hi Michael, following the failure:

src/host_memory.c:68:26: style: Unused variable: host_ctx [unusedVariable] 12:14:56 struct host_memory_ctx *host_ctx;

HassanKhadour avatar Sep 19 '22 09:09 HassanKhadour

The check has been passed, reported wrong status, fixed it

HassanKhadour avatar Sep 19 '22 13:09 HassanKhadour

Thanks!

mrgolin avatar Sep 19 '22 13:09 mrgolin

@galpress @rajachan @HassanKhadour any comments will be appreciated.

mrgolin avatar Oct 23 '22 09:10 mrgolin

How is that this PR is still waiting for review after we've been asked to refactor perftest accelerators memory support and at the same time commits are pushed directly to master adding a new memory type (Cuda dma-buf) in an old way, blowing the code, making it unmaintainable (exactly as @gal-pressman and @rajachan mentioned) and making this PR unmergeable?

https://github.com/linux-rdma/perftest/commit/8b666479b6b6a18c1e9b9d6b8b0d3e82e0f677a1 https://github.com/linux-rdma/perftest/commit/e589f4cc658611796ef491d326ab1faa70d14315 https://github.com/linux-rdma/perftest/commit/b0823b7000c9130396a07749d692b730120dcb82 https://github.com/linux-rdma/perftest/commit/31159151cf90737ddd31b49fd1c412016a427978

mrgolin avatar Nov 24 '22 10:11 mrgolin

Hi mrgolin,

We are sorry we weren't in the loop of the review from the beginning, we apologize about the inconvenience that is caused. I want to ask @gal-pressman @rajachan who were in the loop if it would be okay if they rereview, I don't have any problem with merging this feature more than running few basic test in order to test its performance although I don't see any code that can possibly cause performance difference. once having the needed review we can merge it and the dma_buf will be handled by me and @sshaulnv

HassanKhadour avatar Nov 24 '22 11:11 HassanKhadour

I merely suggested it doesn't make sense that the same code is copy-pasted again and again. Since you're already pushing other code that does it, might as well merge this one as well..

gal-pressman avatar Nov 24 '22 11:11 gal-pressman

I merely suggested it doesn't make sense that the same code is copy-pasted again and again. Since you're already pushing other code that does it, might as well merge this one as well..

Gal, I actually did what you suggested (See https://github.com/linux-rdma/perftest/pull/155/commits/457fc467ec70a3c4680a1444f71fac48b0968aab) and will appreciate if you can review this again.

mrgolin avatar Nov 24 '22 12:11 mrgolin

Thanks for your contribution. The rdma_cm is damaged, getting segmentation fault after running with it. Can you please fix it and address the comments ?

Thanks for your review. We are currently not using rdma_cm, can you please share dmesg and any additional info that might help find the issue.

mrgolin avatar Dec 21 '22 12:12 mrgolin

Thanks for your contribution. The rdma_cm is damaged, getting segmentation fault after running with it. Can you please fix it and address the comments ?

Thanks for your review. We are currently not using rdma_cm, can you please share dmesg and any additional info that might help find the issue.

The segfault occurs in Client side. I'll explain the flow that causes the segfault - e.g. for ib_write_bw: establish_connection (line 120) --> rdma_client_connect --> ctx_init --> ctx->memory->init(ctx->memory)

But the ctx->memory still isn't initiated in this stage.

HassanKhadour avatar Dec 27 '22 08:12 HassanKhadour

Thanks for your contribution. The rdma_cm is damaged, getting segmentation fault after running with it. Can you please fix it and address the comments ?

Thanks for your review. We are currently not using rdma_cm, can you please share dmesg and any additional info that might help find the issue.

The segfault occurs in Client side. I'll explain the flow that causes the segfault - e.g. for ib_write_bw: establish_connection (line 120) --> rdma_client_connect --> ctx_init --> ctx->memory->init(ctx->memory)

But the ctx->memory still isn't initiated in this stage.

Thanks for specifying the error flow. Fixed.

mrgolin avatar Jan 09 '23 16:01 mrgolin

@HassanKhadour we've rebased this PR on Cuda dmabuf support changes and adopted it to our refactoring.

mrgolin avatar Jan 09 '23 16:01 mrgolin

@HassanKhadour Can you please share failing checks?

mrgolin avatar Jan 19 '23 13:01 mrgolin

image

HassanKhadour avatar Jan 19 '23 16:01 HassanKhadour

All checks have failed

@HassanKhadour please elaborate.

mrgolin avatar Jan 22 '23 11:01 mrgolin

image

HassanKhadour avatar Jan 22 '23 13:01 HassanKhadour

the use_cuda with rdma_cm is damaged, getting segmentation fault.

sshaulnv avatar Jan 25 '23 13:01 sshaulnv

the use_cuda with rdma_cm is damaged, getting segmentation fault.

Thanks! I fixed a bug in perftest_communication, please report if there are other related issues.

mrgolin avatar Jan 29 '23 18:01 mrgolin

@HassanKhadour If there are no additional comments can you please merge this PR?

mrgolin avatar Feb 06 '23 16:02 mrgolin

Hi @mrgolin , I had some additional comments on the code. Sure we are working on merging the change and we have set a plan for it, but as you know its a huge change that touch all perftest code and we are trying to test everything in it. Thanks

HassanKhadour avatar Feb 06 '23 16:02 HassanKhadour

Hi @mrgolin , I had some additional comments on the code. Sure we are working on merging the change and we have set a plan for it, but as you know its a huge change that touch all perftest code and we are trying to test everything in it. Thanks

Thanks! Please keep me looped and let me know if I can help anyhow.

mrgolin avatar Feb 06 '23 16:02 mrgolin

Hi @mrgolin, Appreciate your help in testing the ROCm changes on devices that support ROCm & other vendor devices that supports RDMA, from our side I'm testing on Nvidia devices. Also can u please answer the comments above?

Thanks

HassanKhadour avatar Feb 08 '23 12:02 HassanKhadour