pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Rocm Dockerfile for AMD GPU support

Open sgschwindAMD opened this issue 3 years ago • 4 comments

What does this PR do?

Adds a Dockerfile to build Lightning for ROCm

Closes #13609.

Does your PR introduce any breaking changes?

None

Before submitting

  • [ ] Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • [x] Did you read the contributor guideline, Pull Request section?
  • [x] Did you make sure your PR does only one thing, instead of bundling different changes together?
  • [x] Did you make sure to update the documentation with your changes? (if necessary)
  • [x] Did you write any new necessary tests? (not for typos and docs)
  • [x] Did you verify new and existing tests pass locally with your changes?
  • [x] Did you list all the breaking changes introduced by this pull request?
  • [ ] Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR. Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

  • [ ] Is this pull request ready for review? (if not, please submit in draft mode)
  • [ ] Check that all items from Before submitting are resolved
  • [ ] Make sure the title is self-explanatory and the description concisely explains the PR
  • [ ] Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

cc @pruthvistony

sgschwindAMD avatar Jul 12 '22 00:07 sgschwindAMD

Thanks for the feed back, I'll work on these changes as well as changing the dockerfile it builds FROM from rocm/pytorch:latest to rocm/pytorch:latest-release.

sgschwindAMD avatar Jul 12 '22 21:07 sgschwindAMD

Please review the restructured files in the base-rocm folder. We made significant changes to the structure of the code to mirror the base-cuda Dockerfile more closely

sgschwindAMD avatar Aug 04 '22 00:08 sgschwindAMD

@sgschwindAMD @pruthvistony Would you mind including 7566923 in this branch as commented in #13610 (comment)? Without this commit, we can't verify the change made in this PR in CI.

Also, following comments in #13609, let's not merge this PR for now. Even if we don't merge this, developers are still able to build docker images from the Dockerfile and work on ROCm integration.

I have cherry-picked 7566923 into the PR. Thanks for the change.

pruthvistony avatar Aug 05 '22 06:08 pruthvistony

@sgschwindAMD @pruthvistony Would you mind including 7566923 in this branch as commented in #13610 (comment)? Without this commit, we can't verify the change made in this PR in CI. Also, following comments in #13609, let's not merge this PR for now. Even if we don't merge this, developers are still able to build docker images from the Dockerfile and work on ROCm integration.

I have cherry-picked 7566923 into the PR. Thanks for the change.

@akihironitta , Thanks for the github actions changes, I see the build-rocm job executing.

pruthvistony avatar Aug 06 '22 00:08 pruthvistony

@akihironitta @Borda , Please let us know if PR can be merged? let me know if I need to rebase it?

pruthvistony avatar Aug 11 '22 07:08 pruthvistony

@pruthvistony It cannot be merged before we are able to verify PL with AMD GPUs (see https://github.com/Lightning-AI/lightning/issues/13609#issuecomment-1204851165 for reference).

justusschock avatar Aug 11 '22 21:08 justusschock

@pruthvistony sorry for delay, let me check it this week :otter:

Borda avatar Sep 12 '22 19:09 Borda

Closing as we don't have access to amd hardware at the moment

carmocca avatar Nov 02 '22 20:11 carmocca

@sgschwindAMD Hi, can you please tell me if ROCmSoftwarePlatform:ROCm_support is working properly at this moment?

IncubatorShokuhou avatar Jul 06 '23 13:07 IncubatorShokuhou

@sgschwindAMD Hi, can you please tell me if ROCmSoftwarePlatform:ROCm_support is working properly at this moment?

@IncubatorShokuhou , Can you please let me know what do you mean by working properly on ROCm? When the PR was raised it was working properly. Now obviously the fork ROCmSoftwarePlatform:ROCm_support is quiet old and wasnt updated.

We want to add support for ROCm (AMD GPUs) in Lightning-AI, so please let me know if you have any AMD hardware access. We can work on this and update the PR and provide support.

cc @carmocca

pruthvistony avatar Jul 06 '23 17:07 pruthvistony

@pruthvistony We would need to setup a CI job that runs on AMD hardware to be confident about its support. That's what we do for all other accelerators. Would you be able to help with that?

cc @Borda

carmocca avatar Jul 09 '23 18:07 carmocca

@sgschwindAMD Hi, can you please tell me if ROCmSoftwarePlatform:ROCm_support is working properly at this moment?

@IncubatorShokuhou , Can you please let me know what do you mean by working properly on ROCm? When the PR was raised it was working properly. Now obviously the fork ROCmSoftwarePlatform:ROCm_support is quiet old and wasnt updated.

We want to add support for ROCm (AMD GPUs) in Lightning-AI, so please let me know if you have any AMD hardware access. We can work on this and update the PR and provide support.

cc @carmocca

I am currently getting ready to utilize lightning on a ROCm machine, but I haven't gained access to the machine yet. I sincerely appreciate the opportunity to receive your valuable assistance, and I am truly grateful for your help.

IncubatorShokuhou avatar Jul 20 '23 03:07 IncubatorShokuhou

@carmocca @IncubatorShokuhou , Regarding getting CI machines, I need to check internally with AMD infrastructure team/budget and will get back soon.

pruthvistony avatar Jul 27 '23 06:07 pruthvistony

Regarding getting CI machines, I need to check internally with AMD infrastructure team/budget and will get back soon.

That would be great! Let me know if I can help...

Borda avatar Jul 27 '23 17:07 Borda