[WIP] Training on AMD / ROCm
Currently, there are the following issues on LUMI / AMD MI250x for pytorch:
- ROCm flash attention is not supported natively in pytorch, and must be used via the external package https://github.com/ROCm/flash-attention/
- I think having native support is behind this PR: https://github.com/ROCm/pytorch/pull/1353
- when flash attention is enabled, multi-GPU does not work (crashes with HIP memory allocation errors)
- I think this is caused by pytorch using rocm 5.6, while the LUMI system driver officially only supports ROCm up to 5.4. According to LUMI people, a driver upgrade is planned for ~April, let's see.
- I think a related error is appearing for rocm/tensorflow: https://github.com/ROCm/tensorflow-upstream/issues/2289#issuecomment-1909583933
With the custom flash attention package, single-GPU training works OK though.
I followed the LUMI pytorch setup from here: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/#example-for-distributed-learning
they only need to update rocm 6.0.5 ? waiting for a new docker ... you could try as other did to use an uptodate rocm repository and do not wait for a proper docker ...
Pytorch 2.3.0 was released which seems to have improved FlashAttention support on ROCm builtin.
https://github.com/pytorch/pytorch/releases/tag/v2.3.0
Need to wait for the new rocm/pytorch tag: https://hub.docker.com/r/rocm/pytorch/tags
Also I haven't seen any update to the LUMI driver yet: https://lumi-supercomputer.github.io/LUMI-training-materials/User-Updates/
This is the current information from the LUMI team.
Thank you for sending this improvement request. We are very aware of issues
associated with ROCm environment not being updated frequently. The next system
upgrade that would happen in summer will include GPU driver and ROCm update.
Details will be announced whenever precise schedule is decided.
Things are moving at LUMI but at a glacial pace:
"We will be taking the system offline for maintenance starting on Monday, 19 August, 2024. LUMI won't be accessible as this will affect all the partitions. Significant parts of the system software will be updated, and more particularly the system software stack, in order to get a more stable and up-to-date system after the break. We expect the system to be back in production on Monday, 9 September, 2024."
I could resolve the memory but not the multiple GPUs