lightning-thunder
lightning-thunder copied to clipboard
[ci] : We should add a CI flow with TransformerEngine installed so that we can run the relevant tests.
We don't have a flow for testing TransformerEngine (TE) executor.
It would be great to have a CI flow with TE installed to be able to run relevant tests so that we can catch the breakages early. It can also be enabled only before merge or with a github comment.
NOTE: TE needs to be built from source at the moment. (@xwang233 knows how to set this up in docker).
cc @borda
@kshitij12345 could you please share a reference to TE and eventually how it needs to be installed?
Here are the instructions: https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html#installation-from-source
ok, will check it later this week
As highlighted by @carmocca (thanks!), our CIs run on 3090 which don't have FP8 support. We should pursue this once the CI has GPUs with compute capability 8.9 or higher.
GPU Compute Capability Ref: https://developer.nvidia.com/cuda-gpus TE Check for FP8 support: https://github.com/NVIDIA/TransformerEngine/blob/f85553ea369da15fd726ab279818e415be48a228/transformer_engine/pytorch/fp8.py#L23-L34
Cc @t-vi @lantiga so far so I know, there is plan to change used GPU for CI
#209 for reference
TransformerEngine is now available in PyPI: https://pypi.org/project/transformer-engine/:
pip install transformer_engine[pytorch]
Wanted to check if upgrading the GPUs in CI is still being planned (ref). This would help us run TE tests in CI to avoid silent regressions.
Example - https://github.com/Lightning-AI/lightning-thunder/issues/1624, https://github.com/Lightning-AI/lightning-thunder/pull/1626
Another example - https://github.com/Lightning-AI/lightning-thunder/pull/1690
Yes, this is absolutely desirable. - And thank you for trying to keep up with the TE things manually. - The most likely scenario to my mind would be to run the TE tests on an ADA GPU via the lightning sdk. (@lantiga)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Any update on this?
cc: @t-vi @lantiga
Here are the test timings on H100 (total around 2 mins)
pytest thunder/tests/test_transformer_engine_executor.py -v
======================================= 8 passed, 2 skipped, 1 xfailed, 42 warnings in 4.16s =======================================
pytest thunder/tests/distributed/test_ddp.py -k transformer -v -rs
================================================ 2 passed, 21 deselected in 35.25s =================================================
pytest thunder/tests/distributed/test_fsdp.py -k transformer -v -rs
====================================== 4 passed, 1 skipped, 41 deselected in 75.30s (0:01:15) ======================================
pytest thunder/tests/test_transformer_engine_v2_executor.py -v -rs
======================================= 9 passed, 5 skipped, 1 xpassed, 44 warnings in 7.95s =======================================