flux
flux copied to clipboard
[QUESTION]你好,我在nv H20上跑./launch.sh test/python/gemm_rs/test_gemm_rs.py 16384 4096 16384 --dtype=bfloat16 --iters=5报了这个错,H800上同样的镜像和命令没有问题?是H20有啥不一样吗?
@ZSL98 can you verify this?
I don't think we hit this issue on H20, can you double check on your side? like comment different parts and see which part trigger it. @jinchen89
I cannot reproduce this issue. Have you already resolved? @jinchen89
我用的torch2.3,是不是版本太低了?
there is no python stacktrace, guess there is a core dump. can you set ulimit -c unlimited and then runs again
@jinchen89 Does the problem still apply or no?