Megatron-DeepSpeed issues

pip install -e . failed with ModuleNotFoundError: No module named 'torch'

2

(gh_Megatron-DeepSpeed_yk) ub2004@ub2004-B85M-A0:~/nndev/Megatron-DeepSpeed_yk$ python3 Python 3.8.10 (default, Mar 13 2023, 10:26:41) [GCC 9.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> (gh_Megatron-DeepSpeed_yk) ub2004@ub2004-B85M-A0:~/nndev/Megatron-DeepSpeed_yk$ pip...

SeekPoint

questions about inconsistent evaluation result

Hi，i have used deepspeed framework to train gpt-117M model. when i evaluate model perfomance on wikitext-103, result by using tasks/eval_harness/evaluate.py vs. first convert checkpoint to megatron format and use tasks/main.py...

coorful

Add UL2 data sampling and pretraining

3

This adds pretraining using [UL2](https://arxiv.org/abs/2205.05131) for both encoder-decoder, non-causal decoder-only, and causal decoder-only models. I have not yet run large-scale tests to see if it yields the desired training improvements,...

janEbert

hello， I meet a problem

8

hello， when I run script to train gpt model，I meet an assertion error：Not sure how to proceed, we were given deepspeed configs in the deepspeed arguments and deepspeed. the script...

etoilestar

Question about ds to universal

``` def _merge_zero_shards(param_base_path, state, tp_degree, slice_shape): slices = [] for tp_index in range(tp_degree): prefix_path = os.path.join(param_base_path, str(tp_index), f"{state}") paths = sorted(list(glob.glob(f"{prefix_path}.0*"))) #print(paths) shards = [torch.load(p) for p in paths] slice...

saxh

Add xPos embeddings

See https://arxiv.org/abs/2212.10554.

janEbert

How to properly use Flops Profiler with pipelined parallelism?

1. I train a gpt2 model with pipeline parallerlism, Flops Profiler in ds config is useless, it output nothing 2. So add some code like this： ``` prof = None...

flyingdown

Help me, I'm dying soon，error: command '/opt/rh/devtoolset-7/root/usr/bin/gcc' failed with exit code 1 error: subprocess-exited-with-error

used the following installation method, but received an error that has not been resolved for several days： git clone https://github.com/NVIDIA/apex cd apex pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check ....

listwebit

Megatron-DeepSpeed only applies to specific models?

Is Megatron-DeepSpeed only targeting specific models such as GPT-2? Can it support parallel partitioning of relatively lightweight models such as CLIP?

Bob-cby

How to continue pre-training Bloom?

2

Hi I'm trying to continue pre-training the bloom-560m on my own dataset on a single GPU. I modified [this script](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/examples/pretrain_gpt_single_node.sh) to fit my case. However, i cannot figure out how...

ShinoharaHare

Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard

Metadata

pip install -e . failed with ModuleNotFoundError: No module named 'torch'

questions about inconsistent evaluation result

Add UL2 data sampling and pretraining

hello， I meet a problem

Question about ds to universal

Add xPos embeddings

How to properly use Flops Profiler with pipelined parallelism?

Help me, I'm dying soon，error: command '/opt/rh/devtoolset-7/root/usr/bin/gcc' failed with exit code 1 error: subprocess-exited-with-error

Megatron-DeepSpeed only applies to specific models?

How to continue pre-training Bloom?

← Metadata

Owner

Metadata

Megatron-DeepSpeed Megatron-DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard