Saaketh Narayan issues

Results 9 issues of


                                            Saaketh Narayan

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch

### Describe the bug I'm using CUDA-aware OpenMPI that uses UCX (from one of NVIDIA's [PyTorch images](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-10.html), which has UCX installed as part of HPC-X) to perform collectives between GPUs....

Bug

Decoupling max_duration from t_max in LR schedulers

# What does this PR do? This decouples max_duration from t_max in LR schedulers. # What issue(s) does this change relate to? # Before submitting - [ ] Have you...

Make `fc_type` a dict to pass fc kwargs through

Some custom FC layers will need custom kwargs. This PR enables that by changing `fc_type` from `str` to `Union[str, Dict]`, and converting it to dict thereafter. Default configs have also...

Streaming version bump to 0.7.6

self descriptive

Output scale not being used with `te_gemm` in FP8

Hey, I'm using the `te_gemm` function defined in the PyTorch extensions [here](https://github.com/cli99/TransformerEngine/blob/6b21f606f2459d49c2113d69236d68d334edeb4c/transformer_engine/pytorch/csrc/extensions/gemm.cu#L10), and I'm trying to apply a scaling factor to the output. My gemm inputs are in fp8e4m3 and...

question

Saaketh Narayan

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch

Decoupling max_duration from t_max in LR schedulers

Make `fc_type` a dict to pass fc kwargs through

Streaming version bump to 0.7.6

Output scale not being used with `te_gemm` in FP8

Don't use TP when `tensor_parallel_degree` is 1

Correctly process `parallelism_config['tp']` when it's a dict

Optionally use flash-attn's CE loss for metrics

Bump PEFT Version