tginart

Results 22 comments of tginart

Hi @HamidShojanazeri I am also seeing this issue. I have tried both `export NCCL_ASYNC_ERROR_HANDLING=1` and `export TORCH_NCCL_ASYNC_ERROR_HANDLING=1` but I still get the error: `torch.distributed.DistBackendError: [14] is setting up NCCL communicator...

Thank you for the interest and the question. The proof that uses the gamma correction is Lemma C.3. It's not that the cluster size is guaranteed, but it's that in...

Thank you for bringing this up. After looking back at my notes, I do think that the denominators you've circled are mistakes and should be as you've mentioned. Actually though,...

Ya'll may find this script helpful: ``` import argparse import loralib as lora import transformers from tqdm import tqdm def lora_process(model_name, max_seq_len, attn_impl, r_emb, r): print("Loading model configurations...") config =...

> > except that it's applied in a somewhat nonstandard way in the fwd pass of the transformer module > > @tginart Can you say more about this? https://github.com/mosaicml/llm-foundry/blob/86864e90e0063651177837e831fe48e80618b969/llmfoundry/models/mpt/modeling_mpt.py#LL485C1-L487C1 @samhavens...

Hi! No, this did not. I am suspecting that there is some kind of issue in the current docker env with the StreamingDataset. For example, running this script: ```import numpy...

FYI that script is pulled from the Streaming docs: https://docs.mosaicml.com/projects/streaming/en/stable/getting_started/quick_start.html

@Paladiamors Any luck with getting Triton's flash attention set up? I've tried 3 different machines/GPU types and close to a half dozen different envs/images and can't get that package to...

Interesting! My issue was also on the A10. I'm using the `g5.12xlarge` instance type with the `Deep Learning AMI GPU PyTorch 1.13.1 (Ubuntu 20.04)` AMI. I tried both the mosaicml/pytorch...