Vitaliy Chiley
Vitaliy Chiley
This is awesome! When will it be merged into master / a release?
`activation_checkpointing_reentrant: false` actckpt error without amp_fp8 ``` ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /mnt/llm-foundry/scripts/train/train.py:254 in │ │ │ │ 251 │ │ yaml_cfg = om.load(f) │ │ 252...
actckpt error with amp_fp8 (with `activation_checkpointing_reentrant: false or true`) ``` ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /mnt/llm-foundry/scripts/train/train.py:260 in │ │ │ │ 257 │ │ yaml_cfg = om.load(f)...
Note: [transformer_engine has its own ckpt util](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/pytorch.html?highlight=checkpoint#transformer_engine.pytorch.checkpoint) which might need to be integrated into composer (?) for fp8 to work with act ckpt???
[TE @ main requires flash-attn==1.0.6](https://github.com/NVIDIA/TransformerEngine/blob/main/setup.py#L286) flash-attn==1.0.6 has [this](https://github.com/HazyResearch/flash-attention/issues/246) issue Solution: add `--no-build-isolation` to pip install
installing TE from main (as suggested by @abhi-mosaic) makes our integrated act ckpt work; no need to integrate [TE act ckpt](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/pytorch.html?highlight=checkpoint#transformer_engine.pytorch.checkpoint) `activation_checkpointing_reentrant: false` still broken
CE per param  TFLOPS per param  Note: 3B model uses act chpt so its model TFLOPS...
Note: ``` Traceback (most recent call last): File "", line 21, in _bwd_kernel KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.bfloat16, torch.bfloat16, torch.bfloat16, None, torch.bfloat16, torch.float32, torch.bfloat16, torch.bfloat16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32',...
Standard I1k training with a 224 img size input 1. resizes the image to 256 2. crops out a 224 resolution img where 224/256 is the crop ratio.
It looks like you didn't install all the requirements (`pip install .[gpu]` from the top level dir).