flux icon indicating copy to clipboard operation
flux copied to clipboard

[QUESTION] Is pretraining possible in Megatron using this method?

Open yeontaek opened this issue 7 months ago • 5 comments

I've tried several approaches, but due to compatibility issues between the Transformer Engine (TE) version and the PyTorch version, I had difficulty getting Flux to work properly.

yeontaek avatar May 20 '25 10:05 yeontaek

You have to use TE and TE requires a torch version. That torch version is not compatible with FLUX, right?

You can compile FLUX from source with the torch version you want.

houqi avatar May 29 '25 03:05 houqi

@houqi Thank you for your response. I'm currently trying to run pretraining using the repository below. Would it be possible for you to share a Dockerfile that works with Megatron-LM? https://github.com/ZSL98/Megatron-LM/

yeontaek avatar Jul 10 '25 06:07 yeontaek

Additionally, based on what I’ve found, Flux works under the following conditions: torch (2.4.0, 2.5.0, 2.6.0), python (3.10, 3.11), and cuda (12.4). I’m currently building a Dockerfile and attempting pretraining using the nvcr.io/nvidia/pytorch:24.05-py3 image, which meets these requirements. Do you happen to know if there is a version of Transformer Engine (TE) that is compatible with these versions?

yeontaek avatar Jul 10 '25 06:07 yeontaek

Additionally, based on what I’ve found, Flux works under the following conditions: torch (2.4.0, 2.5.0, 2.6.0), python (3.10, 3.11), and cuda (12.4). I’m currently building a Dockerfile and attempting pretraining using the nvcr.io/nvidia/pytorch:24.05-py3 image, which meets these requirements. Do you happen to know if there is a version of Transformer Engine (TE) that is compatible with these versions?

sorry that I'm not so familar with TE. you have to find it out yourself.

houqi avatar Jul 24 '25 08:07 houqi

@houqi Thank you for your response. I'm currently trying to run pretraining using the repository below. Would it be possible for you to share a Dockerfile that works with Megatron-LM? https://github.com/ZSL98/Megatron-LM/

I will try to make sure that some nvcr.io pytorch versions are supported.

houqi avatar Jul 24 '25 08:07 houqi