Document Supported Datatypes Per Compute Capability
I've found that the latest docker images (and presumably this repo broadly) do not support RTX Pro 6000 (SM120) for MXFP8 (see below error).
I've been unable to find any documentation on SM120 on the docs page or in this repo about what formats are or aren't supported in software. Similarly, I haven't been able to find any information on if there's a roadmap to add support for these.
Would really appreciate any info for this! I really want to train with that FP8/FP4!
Traceback (most recent call last):
File "/app/src/llm_train.py", line 373, in <module>
main()
File "/app/src/llm_train.py", line 313, in main
with precision_context:
File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
return next(self.gen)
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/fp8.py", line 664, in fp8_autocast
check_recipe_support(fp8_recipe)
File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/fp8.py", line 77, in check_recipe_support
assert recipe_supported, unsupported_reason
^^^^^^^^^^^^^^^^
AssertionError: MXFP8 (for all gemm layouts) is not supported on 12.0+ architectures yet.
A good solution would look like
A section in the README and page on the docs website including a table of supported data formats with columns for hardware/software support.
A section in the README and page on the docs including a timeline showing software support for SM120 being planned for XYZ date.
Alternatives I've considered
Using different hardware seems to be the only option right now. It seems MS-AMP and pretty much everywhere across the board are not supporting SM120 for FP8/FP4.
Additional Context
Working on a research project and have been developing on SM120, wanted to test out my code with FP8/FP4 to evaluate performance/speed tradeoffs before renting larger hardware for full training jobs.