侯奇
侯奇
do you compile with --arch 90?
@ZSL98 can you verify this?
there is no python stacktrace, guess there is a core dump. can you set `ulimit -c unlimited` and then runs again
V2 and V3 are nearly the same. refer to https://github.com/bytedance/flux/blob/main/python/flux/cpp_mod.pyi#L649 here
seems no torch 2.4.0 wheel package compiled. @zheng-ningxin
You have to use TE and TE requires a torch version. That torch version is not compatible with FLUX, right? You can compile FLUX from source with the torch version...
> Additionally, based on what I’ve found, Flux works under the following conditions: **torch (2.4.0, 2.5.0, 2.6.0), python (3.10, 3.11), and cuda (12.4).** I’m currently building a Dockerfile and attempting...
> [@houqi](https://github.com/houqi) Thank you for your response. I'm currently trying to run pretraining using the repository below. Would it be possible for you to share a Dockerfile that works with...
Yes, it's possible in theory, but not implemented. I don't know that split-k is faster than stream-k. Can you provide some cases where split-k is faster than stream-k?
thanks for your report. You can refer to here:https://github.com/bytedance/flux/blob/main/src/generator/gen_moe_gather_rs.cc#L90 and add some arguments. ```C++ static constexpr auto AllGemmHParams_FP16 = make_space_gemm_hparams( cute::make_tuple(make_gemm_v3_hparams(Shape{})), cute::make_tuple( make_gather_rs_hparams(cute::Int{}, cute::Int{}), make_gather_rs_hparams(cute::Int{}, cute::Int{}), make_gather_rs_hparams(cute::Int{}, cute::Int{}), make_gather_rs_hparams(cute::Int{}, cute::Int{}),...