Mayank Mishra comments

Results 187 comments of


                                            Mayank Mishra

[new feature] train finishing ETA!

From what I see, this has a lot of intricacies: ramp_up_batch_size, curriculum_learning (changine sequence length) so my suggestion is to compute as time_per_iter * (total_tokens - consumed_tokens) /86400 However, I...

how to convert huggingface model to megatron-deepspeed?

This is not possible. To download DS checkpoints refer to this issue: https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/319

how to convert huggingface model to megatron-deepspeed?

I don't understand the issue. Do you just need to run inference? If that is the case, DS-inference is compatible with all Huggingface models.

how to convert huggingface model to megatron-deepspeed?

@AnShengqiang Its non-trivial to convert models for training. People are actively exploring this as far as I know. This repository saves something called a universal checkpoint which can be converted...

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

Hmm, its not working for me even within a single node with batch size = 1, 8x A100 80gb Same, CUDA illegal memory access error

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

@pohunghuang-nctu nothing like that in my logs This is the full trace ``` [2022-07-26 11:41:08,472] [WARNING] [runner.py:159:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2022-07-26...

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 Any pointers?

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

> > I am not sure why I am getting the same error ^^ for batch size = 1. @pai4451 > > Any pointers? > > What is your CUDA...

Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

> @mayank31398 Perhaps try the `ds-inference/bloom-fix` branch of deepspeed? Ill try this today. thanks

[Tensorboard] Log text prediction in evaluation

@KMFODA @thomasw21 , https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328 Already added the ability to benchmark system, interactive cli, and a generation server. Testing a few things. Will try to get this merged by this week