pytorch-lightning
pytorch-lightning copied to clipboard
Rocm Dockerfile for AMD GPU support
What does this PR do?
Adds a Dockerfile to build Lightning for ROCm
Closes #13609.
Does your PR introduce any breaking changes?
NoneBefore submitting
- [ ] Was this discussed/approved via a GitHub issue? (not for typos and docs)
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Did you make sure your PR does only one thing, instead of bundling different changes together?
- [x] Did you make sure to update the documentation with your changes? (if necessary)
- [x] Did you write any new necessary tests? (not for typos and docs)
- [x] Did you verify new and existing tests pass locally with your changes?
- [x] Did you list all the breaking changes introduced by this pull request?
- [ ] Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)
PR review
Anyone in the community is welcome to review the PR. Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
- [ ] Is this pull request ready for review? (if not, please submit in draft mode)
- [ ] Check that all items from Before submitting are resolved
- [ ] Make sure the title is self-explanatory and the description concisely explains the PR
- [ ] Add labels and milestones (and optionally projects) to the PR so it can be classified
Did you have fun?
Make sure you had fun coding 🙃
cc @pruthvistony
Thanks for the feed back, I'll work on these changes as well as changing the dockerfile it builds FROM from rocm/pytorch:latest to rocm/pytorch:latest-release.
Please review the restructured files in the base-rocm folder. We made significant changes to the structure of the code to mirror the base-cuda Dockerfile more closely
@sgschwindAMD @pruthvistony Would you mind including 7566923 in this branch as commented in #13610 (comment)? Without this commit, we can't verify the change made in this PR in CI.
Also, following comments in #13609, let's not merge this PR for now. Even if we don't merge this, developers are still able to build docker images from the Dockerfile and work on ROCm integration.
I have cherry-picked 7566923 into the PR. Thanks for the change.
@sgschwindAMD @pruthvistony Would you mind including 7566923 in this branch as commented in #13610 (comment)? Without this commit, we can't verify the change made in this PR in CI. Also, following comments in #13609, let's not merge this PR for now. Even if we don't merge this, developers are still able to build docker images from the Dockerfile and work on ROCm integration.
I have cherry-picked 7566923 into the PR. Thanks for the change.
@akihironitta , Thanks for the github actions changes, I see the build-rocm job executing.
@akihironitta @Borda , Please let us know if PR can be merged? let me know if I need to rebase it?
@pruthvistony It cannot be merged before we are able to verify PL with AMD GPUs (see https://github.com/Lightning-AI/lightning/issues/13609#issuecomment-1204851165 for reference).
@pruthvistony sorry for delay, let me check it this week :otter:
Closing as we don't have access to amd hardware at the moment
@sgschwindAMD Hi, can you please tell me if ROCmSoftwarePlatform:ROCm_support is working properly at this moment?
@sgschwindAMD Hi, can you please tell me if ROCmSoftwarePlatform:ROCm_support is working properly at this moment?
@IncubatorShokuhou , Can you please let me know what do you mean by working properly on ROCm? When the PR was raised it was working properly. Now obviously the fork ROCmSoftwarePlatform:ROCm_support is quiet old and wasnt updated.
We want to add support for ROCm (AMD GPUs) in Lightning-AI, so please let me know if you have any AMD hardware access. We can work on this and update the PR and provide support.
cc @carmocca
@pruthvistony We would need to setup a CI job that runs on AMD hardware to be confident about its support. That's what we do for all other accelerators. Would you be able to help with that?
cc @Borda
@sgschwindAMD Hi, can you please tell me if ROCmSoftwarePlatform:ROCm_support is working properly at this moment?
@IncubatorShokuhou , Can you please let me know what do you mean by working properly on ROCm? When the PR was raised it was working properly. Now obviously the fork ROCmSoftwarePlatform:ROCm_support is quiet old and wasnt updated.
We want to add support for ROCm (AMD GPUs) in Lightning-AI, so please let me know if you have any AMD hardware access. We can work on this and update the PR and provide support.
cc @carmocca
I am currently getting ready to utilize lightning on a ROCm machine, but I haven't gained access to the machine yet. I sincerely appreciate the opportunity to receive your valuable assistance, and I am truly grateful for your help.
@carmocca @IncubatorShokuhou , Regarding getting CI machines, I need to check internally with AMD infrastructure team/budget and will get back soon.
Regarding getting CI machines, I need to check internally with AMD infrastructure team/budget and will get back soon.
That would be great! Let me know if I can help...