[QUESTION] Is pretraining possible in Megatron using this method?
I've tried several approaches, but due to compatibility issues between the Transformer Engine (TE) version and the PyTorch version, I had difficulty getting Flux to work properly.
You have to use TE and TE requires a torch version. That torch version is not compatible with FLUX, right?
You can compile FLUX from source with the torch version you want.
@houqi Thank you for your response. I'm currently trying to run pretraining using the repository below. Would it be possible for you to share a Dockerfile that works with Megatron-LM? https://github.com/ZSL98/Megatron-LM/
Additionally, based on what I’ve found, Flux works under the following conditions:
torch (2.4.0, 2.5.0, 2.6.0), python (3.10, 3.11), and cuda (12.4).
I’m currently building a Dockerfile and attempting pretraining using the nvcr.io/nvidia/pytorch:24.05-py3 image, which meets these requirements.
Do you happen to know if there is a version of Transformer Engine (TE) that is compatible with these versions?
Additionally, based on what I’ve found, Flux works under the following conditions: torch (2.4.0, 2.5.0, 2.6.0), python (3.10, 3.11), and cuda (12.4). I’m currently building a Dockerfile and attempting pretraining using the
nvcr.io/nvidia/pytorch:24.05-py3image, which meets these requirements. Do you happen to know if there is a version of Transformer Engine (TE) that is compatible with these versions?
sorry that I'm not so familar with TE. you have to find it out yourself.
@houqi Thank you for your response. I'm currently trying to run pretraining using the repository below. Would it be possible for you to share a Dockerfile that works with Megatron-LM? https://github.com/ZSL98/Megatron-LM/
I will try to make sure that some nvcr.io pytorch versions are supported.