Teng-xu

Results 6 issues of Teng-xu

*Issue #, if available:* *Description of changes:* *Testing done:* ## Merge Checklist _Put an `x` in the boxes that apply. You can also fill these out after creating the PR....

*Issue #, if available:* *Description of changes:* *Testing done:* ## Merge Checklist _Put an `x` in the boxes that apply. You can also fill these out after creating the PR....

*Issue #, if available:* *Description of changes:* Update GPT-J Model training example with Tensor Parallelism using SageMaker Model Parallel Library. Update testing scripts to enable latest features with smp. *Testing...

I was trying to install apex through dockerfile (python3.6 cuda11.1) via the following commands ``` RUN git clone https://github.com/NVIDIA/apex && \ cd apex && \ pip install -v --no-cache-dir --global-option="--cpp_ext"...

### System Info - `transformers` version: 4.37.1 - Platform: Linux-5.10.199-190.747.amzn2.x86_64-x86_64-with-glibc2.31 - Python version: 3.10.8 - Huggingface_hub version: 0.20.2 - Safetensors version: 0.3.3 - Accelerate version: 0.26.1 - Accelerate config: not...

**Describe the bug** During our training sessions utilizing Megatron's Mixture of Experts (MoE) layers, we observed a decline in performance occurring at specific steps, with this deterioration manifesting sporadically and...