Parth Mannan comments

Results 22 comments of


                                            Parth Mannan

OOM errors for Gemma-7, pythia-12b, Llama-2-13b-hf and Nous-Hermes-13b with FSDP zero3 and 2x8 H100

I am not sure if that is what happened here but I do see an nvFuser failure pop out in OOM errors. Might not directly be a nvFuser issue but...

OOM errors for Gemma-7, pythia-12b, Llama-2-13b-hf and Nous-Hermes-13b with FSDP zero3 and 2x8 H100

Yea, it is likely the nvFuser stuff was just printed out because OOM happened during execution. I have seen that before.

[Feature request] Optional debugging option to get trace with information on tensor strides along with tensor shapes

That sounds pretty useful and should suffice the requirement. Would calling this transformation generate a full computation trace for every TensorProxy result with the required tensor information? And I am...

Hybrid Data x Context Parallelism Feature

> Hi @parthmannan, could you also start a main PR? Added PR for main here - https://github.com/NVIDIA/Megatron-LM/pull/2282 Will resolve conflicts shortly.

Hybrid Data x Context Parallelism Feature

/ok to test https://github.com/NVIDIA/Megatron-LM/pull/2054/commits/e2a32cb54edf51e9cb000b7e8ab4b55e58e7d846

Hybrid Data x Context Parallelism Feature

/ok to test https://github.com/NVIDIA/Megatron-LM/pull/2054/commits/d53c323bcfb6f088de4ad919adeccf615737f75a

Hybrid Data x Context Parallelism Feature

/ok to test https://github.com/NVIDIA/Megatron-LM/pull/2054/commits/70b9758cefdf016eb30559693f9dfc6ad4a8e246

Hybrid Data x Context Parallelism Feature

/ok to test https://github.com/NVIDIA/Megatron-LM/pull/2054/commits/9387269bf9a641293cbccd68bbbe4f1db874453d

Hybrid Data x Context Parallelism Feature

/ok to test https://github.com/NVIDIA/Megatron-LM/pull/2054/commits/2bde6c85a99988ad4c3a9d37e633908dfb3e8323

Hybrid Data x Context Parallelism Feature

/ok to test https://github.com/NVIDIA/Megatron-LM/pull/2054/commits/604ddd207fddb3f53adffb3e1190d94789a6e8b3