flux icon indicating copy to clipboard operation
flux copied to clipboard

[QUESTION]你好,我在nv H20上跑./launch.sh test/python/gemm_rs/test_gemm_rs.py 16384 4096 16384 --dtype=bfloat16 --iters=5报了这个错,H800上同样的镜像和命令没有问题?是H20有啥不一样吗?

Open jinchen89 opened this issue 8 months ago • 6 comments

Image hardware/software information is as follows: torch = 2.3.1 cuda=12.3 NVIDIA H20 gpu Driver Version: 535.161.08 gpu mem:97871MiB one node 8 opus

jinchen89 avatar Apr 06 '25 04:04 jinchen89

@ZSL98 can you verify this?

houqi avatar Apr 07 '25 23:04 houqi

I don't think we hit this issue on H20, can you double check on your side? like comment different parts and see which part trigger it. @jinchen89

wenlei-bao avatar Apr 14 '25 18:04 wenlei-bao

I cannot reproduce this issue. Have you already resolved? @jinchen89

ZSL98 avatar Apr 15 '25 06:04 ZSL98

我用的torch2.3,是不是版本太低了?

jinchen89 avatar Apr 16 '25 02:04 jinchen89

there is no python stacktrace, guess there is a core dump. can you set ulimit -c unlimited and then runs again

houqi avatar Apr 28 '25 02:04 houqi

@jinchen89 Does the problem still apply or no?

wenlei-bao avatar Apr 28 '25 22:04 wenlei-bao