Parth Mannan issues

Results 7 issues of


                                            Parth Mannan

[BUG] No module named 'torch._six'

Torch does not seem to support torch._six anymore and it has been removed. Refer - https://github.com/pytorch/pytorch/pull/94709 DeepSpeed still has dependency on it. Example in `runtime/utils.py` ``` from torch._six import inf...

bug

inference

Use nvFuser executor decisions to pass on op execution to a different backend and retire hybird `torch_compile_cat_ex` executor.

## 🚀 Feature The feature request is to add the decision making capabilities inside nvFuser executor that allows the nvFuser executor to reject/pass on certain op executions where other backends/executors...

enhancement

executors

torch.compile

Likely memory fragmentation for larger models

## 🐛 Bug Running LLaMa2 13B with FSDP ZeRO2 on 8xH100 ``` torchrun --nproc_per_node=8 --nnodes=1 benchmark_litgpt.py --model_name Llama-2-13b-hf --compile thunder_cudnn --distributed_mode fsdp --shard_mode zero2 --bucketing_mode none --micro_batch_size 1 --global_batch_size 8...

bug

memory use

Provide debugging traces and options as a ENV variable or JIT option

## 🚀 Feature An environment variable that dumps out the various Thunder provided debug traces to a log file. This can have variable levels like `export THUNDER_DEBUG=` ``` 0/'' :...

enhancement

debugging

Distributed and Bucketing Performance Improvements

## 🐛 Bug This is a lengthy issue/post detailing my observations with our distributed and bucketing performance. Some of these are actionable items and some are just observations to be...

bug

enhancement

distributed

performance

[Feature request] Optional debugging option to get trace with information on tensor strides along with tensor shapes

## 🚀 Feature Request Currently we have computation traces with the generated tensor shapes as part of comments next to the computation like ``` t908 = torch.nn.functional.linear(t907, t19, t17) #...

enhancement

debugging

Hybrid Data x Context Parallelism Feature

# What does this PR do ? Added PR for main here - https://github.com/NVIDIA/Megatron-LM/pull/2282 Design document discussed in MCore sync meeting - https://docs.google.com/document/d/1MnIPQ_VbpDNp-adtvcEv-SYx6A8rtt3-fDdxbcdrmk0/edit?usp=sharing The first issue this MR is trying...

enhancement

Expert Review

dev branch