lightning-thunder icon indicating copy to clipboard operation
lightning-thunder copied to clipboard

[ci] : We should add a CI flow with TransformerEngine installed so that we can run the relevant tests.

Open kshitij12345 opened this issue 1 year ago • 11 comments

We don't have a flow for testing TransformerEngine (TE) executor.

It would be great to have a CI flow with TE installed to be able to run relevant tests so that we can catch the breakages early. It can also be enabled only before merge or with a github comment.

NOTE: TE needs to be built from source at the moment. (@xwang233 knows how to set this up in docker).

cc @borda

kshitij12345 avatar Apr 16 '24 10:04 kshitij12345

@kshitij12345 could you please share a reference to TE and eventually how it needs to be installed?

Borda avatar Apr 16 '24 11:04 Borda

Here are the instructions: https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html#installation-from-source

IvanYashchuk avatar Apr 16 '24 11:04 IvanYashchuk

ok, will check it later this week

Borda avatar Apr 16 '24 20:04 Borda

As highlighted by @carmocca (thanks!), our CIs run on 3090 which don't have FP8 support. We should pursue this once the CI has GPUs with compute capability 8.9 or higher.

GPU Compute Capability Ref: https://developer.nvidia.com/cuda-gpus TE Check for FP8 support: https://github.com/NVIDIA/TransformerEngine/blob/f85553ea369da15fd726ab279818e415be48a228/transformer_engine/pytorch/fp8.py#L23-L34

kshitij12345 avatar Apr 25 '24 15:04 kshitij12345

Cc @t-vi @lantiga so far so I know, there is plan to change used GPU for CI

Borda avatar Apr 25 '24 15:04 Borda

#209 for reference

lantiga avatar May 30 '24 18:05 lantiga

TransformerEngine is now available in PyPI: https://pypi.org/project/transformer-engine/: pip install transformer_engine[pytorch]

IvanYashchuk avatar Aug 16 '24 07:08 IvanYashchuk

Wanted to check if upgrading the GPUs in CI is still being planned (ref). This would help us run TE tests in CI to avoid silent regressions.

Example - https://github.com/Lightning-AI/lightning-thunder/issues/1624, https://github.com/Lightning-AI/lightning-thunder/pull/1626

kshitij12345 avatar Jan 09 '25 15:01 kshitij12345

Another example - https://github.com/Lightning-AI/lightning-thunder/pull/1690

kshitij12345 avatar Jan 25 '25 00:01 kshitij12345

Yes, this is absolutely desirable. - And thank you for trying to keep up with the TE things manually. - The most likely scenario to my mind would be to run the TE tests on an ADA GPU via the lightning sdk. (@lantiga)

t-vi avatar Jan 26 '25 12:01 t-vi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '25 05:04 stale[bot]

Any update on this?

cc: @t-vi @lantiga

kshitij12345 avatar Jun 11 '25 09:06 kshitij12345

Here are the test timings on H100 (total around 2 mins)

pytest thunder/tests/test_transformer_engine_executor.py -v

======================================= 8 passed, 2 skipped, 1 xfailed, 42 warnings in 4.16s =======================================


pytest thunder/tests/distributed/test_ddp.py -k transformer -v -rs

================================================ 2 passed, 21 deselected in 35.25s =================================================


pytest thunder/tests/distributed/test_fsdp.py -k transformer -v -rs
====================================== 4 passed, 1 skipped, 41 deselected in 75.30s (0:01:15) ======================================


pytest thunder/tests/test_transformer_engine_v2_executor.py -v -rs
======================================= 9 passed, 5 skipped, 1 xpassed, 44 warnings in 7.95s =======================================

kshitij12345 avatar Jun 16 '25 19:06 kshitij12345