flux [BUG] Fp8 Runtime Error: "bad any

When I tried to use the following command to test fp8 GEMMs, it didn't work.

python3 test/python/gemm_only/test_gemm_only.py 4096 12288 6144 --dtype=float8_e4m3fn

The error message is below:

/usr/local/lib/python3.10/dist-packages/torch/utils/_pytree.py:185: FutureWarning: optree is installed but the version is too old to support PyTorch Dynamo in C++ pytree. C++ pytree support is disabled. Please consider upgrading optree using python3 -m pip install --upgrade 'optree>=0.13.0'. warnings.warn( Traceback (most recent call last): File "flux/test/python/gemm_only/test_gemm_only.py", line 239, in perf_result_flux = perf_flux( File "flux/test/python/gemm_only/test_gemm_only.py", line 151, in perf_flux return perf_gemm(iters, "flux", fn) File "flux/test/python/gemm_only/test_gemm_only.py", line 43, in perf_gemm output = fn() File "flux/test/python/gemm_only/test_gemm_only.py", line 140, in fn return op.forward( RuntimeError: bad any_cast

Mar 26 '25 08:03 DXHPC

can you provide more information, the compile enviroment such as CUDA version and hardware info?

we support FP8. Don't know why it fails.

Mar 27 '25 03:03 houqi

nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Tue_Feb_27_16:19:38_PST_2024 Cuda compilation tools, release 12.4, V12.4.99 Build cuda_12.4.r12.4/compiler.33961263_0

I ran this on a machine with 8 H20 GPUs. It seems that fp8 tests related to GEMM don't work. However, test_moe_ag.py and test_moe_gather_rs.py can run successfully.

Mar 27 '25 07:03 DXHPC

do you compile with --arch 90?

Mar 27 '25 23:03 houqi

@DXHPC what is your build command?

Mar 28 '25 01:03 wenlei-bao

I used the following command to build a package and install: ./build.sh --arch "90" --nvshmem --package pip install ... Since I only needed to test flux on Hopper GPUs, I deleted the archs 80 and 89.

Mar 28 '25 02:03 DXHPC

can you provide more information, the compile enviroment such as CUDA version and hardware info?

we support FP8. Don't know why it fails.

same error in h200 conda create -n "flux" python=3.10 conda activate flux pip install torch==2.4.0 numpy pip install byte-flux

Apr 01 '25 09:04 MarsMeng1994

same bug

./launch.sh test/python/gemm_rs/test_gemm_rs.py 64 5120 7168 --dtype=float8_e5m2 --iters=10
torchrun --node_rank=0 --nproc_per_node=8 --nnodes=1 --rdzv_endpoint=127.0.0.1:23456 test/python/gemm_rs/test_gemm_rs.py 64 5120 7168 --dtype=float8_e5m2 --iters=10
W0414 12:01:55.362000 30737 torch/distributed/run.py:793] 
W0414 12:01:55.362000 30737 torch/distributed/run.py:793] *****************************************
W0414 12:01:55.362000 30737 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0414 12:01:55.362000 30737 torch/distributed/run.py:793] *****************************************
WARNING:root:Failed to load NVSHMEM libs
WARNING:root:Failed to load NVSHMEM libs
WARNING:root:Failed to load NVSHMEM libs
WARNING:root:Failed to load NVSHMEM libs
WARNING:root:Failed to load NVSHMEM libs
WARNING:root:Failed to load NVSHMEM libs
WARNING:root:Failed to load NVSHMEM libs
WARNING:root:Failed to load NVSHMEM libs
[rank2]:[W414 12:02:24.401985987 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank0]:[W414 12:02:24.416766471 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank7]:[W414 12:02:24.423479359 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7]  using GPU 7 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank6]:[W414 12:02:24.428786923 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6]  using GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank4]:[W414 12:02:24.429948913 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4]  using GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank5]:[W414 12:02:24.439084635 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5]  using GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank1]:[W414 12:02:24.440322036 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank3]:[W414 12:02:24.443424526 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
[rank7]: Traceback (most recent call last):
[rank7]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 408, in <module>
[rank7]:     perf_res_flux = perf_flux(
[rank7]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank7]:     return func(*args, **kwargs)
[rank7]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 211, in perf_flux
[rank7]:     _ = gemm_only_op.forward(
[rank7]: RuntimeError: bad any_cast
[rank0]: Traceback (most recent call last):
[rank0]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 408, in <module>
[rank0]:     perf_res_flux = perf_flux(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 211, in perf_flux
[rank0]:     _ = gemm_only_op.forward(
[rank0]: RuntimeError: bad any_cast
[rank2]: Traceback (most recent call last):
[rank2]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 408, in <module>
[rank2]:     perf_res_flux = perf_flux(
[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank2]:     return func(*args, **kwargs)
[rank2]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 211, in perf_flux
[rank2]:     _ = gemm_only_op.forward(
[rank2]: RuntimeError: bad any_cast
[rank5]: Traceback (most recent call last):
[rank5]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 408, in <module>
[rank5]:     perf_res_flux = perf_flux(
[rank5]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 211, in perf_flux
[rank5]:     _ = gemm_only_op.forward(
[rank5]: RuntimeError: bad any_cast
[rank1]: Traceback (most recent call last):
[rank1]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 408, in <module>
[rank1]:     perf_res_flux = perf_flux(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 211, in perf_flux
[rank1]:     _ = gemm_only_op.forward(
[rank1]: RuntimeError: bad any_cast
[rank6]: Traceback (most recent call last):
[rank6]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 408, in <module>
[rank6]:     perf_res_flux = perf_flux(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank6]:     return func(*args, **kwargs)
[rank6]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 211, in perf_flux
[rank6]:     _ = gemm_only_op.forward(
[rank6]: RuntimeError: bad any_cast
[rank3]: Traceback (most recent call last):
[rank3]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 408, in <module>
[rank3]:     perf_res_flux = perf_flux(
[rank3]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]:     return func(*args, **kwargs)
[rank3]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 211, in perf_flux
[rank3]:     _ = gemm_only_op.forward(
[rank3]: RuntimeError: bad any_cast
[rank4]: Traceback (most recent call last):
[rank4]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 408, in <module>
[rank4]:     perf_res_flux = perf_flux(
[rank4]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank4]:     return func(*args, **kwargs)
[rank4]:   File "/sgl-workspace/flux/test/python/gemm_rs/test_gemm_rs.py", line 211, in perf_flux
[rank4]:     _ = gemm_only_op.forward(
[rank4]: RuntimeError: bad any_cast
[rank0]:[W414 12:02:41.712569611 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0414 12:02:42.343000 30737 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 30811 closing signal SIGTERM
W0414 12:02:42.343000 30737 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 30812 closing signal SIGTERM
W0414 12:02:42.343000 30737 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 30813 closing signal SIGTERM
W0414 12:02:42.344000 30737 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 30814 closing signal SIGTERM
W0414 12:02:42.344000 30737 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 30815 closing signal SIGTERM
W0414 12:02:42.344000 30737 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 30816 closing signal SIGTERM
W0414 12:02:42.344000 30737 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 30817 closing signal SIGTERM
E0414 12:02:43.400000 30737 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 7 (pid: 30818) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
test/python/gemm_rs/test_gemm_rs.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-04-14_12:02:42
  host      : ucpe-resource033041131100.na131
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 30818)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Apr 14 '25 12:04 Rainlin007

[BUG] Fp8 Runtime Error: "bad any_cast"