Megatron-DeepSpeed issues

Results 124 Megatron-DeepSpeed issues

Sort by recently updated

Preprocessing from arrow file to load an HF dataset

This adds an option to launch preprocessing from an HF dataset (loaded from an arrow file for now as that's the use-case on JZ) rather than just jsonlines.

TevenLeScao

launch debug code

this is some of the code we have written to debug tied embed synchronization issues, so pushing it here in case it will be needed down the road. Most likely...

stas00

possibly --skip-train-iteration-range with multiple entries has a bug

Just noticed in the logs that `--skip-train-iteration-range` reports only a single range, when there should be 2: I currently have in the config: ``` --skip-train-iteration-range 13251-14000 16651-19500 ``` But the...

stas00

Add `_validate_args`

Closes #124. I found that few checks mentioned in `sanity-checks.md` were already being done in `parse_args` like `NHIDDEN % NHEADS == 0`, `GLOBAL_BATCH_SIZE % (MICRO_BATCH_SIZE * DP_SIZE) == 0` so...

bhavitvyamalik

adding consistency calculations/checks at init time

Stella pointed out to how they do consistency calculations/checks with NeoX: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/neox_arguments/arguments.py It'd be good for someone to study what they did over the base Megatron-LM and replicate anything that...

stas00

Good First Issue

Good Second Issue

Error in Installation due to circular import

I am trying to run tests on the codebase. I am using a docker image on an AWS p3.2xlarge (Tesla V100) ``` docker pull nvcr.io/nvidia/pytorch:21.10-py3 ``` running `python -m pip...

lintangsutawika

Make sure deepspeed powered models are equivalent with their non deepspeed version

@DanielHesslow has opened a PR https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/212. This allows us to evaluate Megatron-Deepspeed models using the EAI harness directly in this repo, without needing to convert models into HF format. The...

thomasw21

Good First Issue

[debug] ModelInspector

This PR resolves #149 by implementing a `ModelInspector` class similar to transformer's [`DebugUnderflowOverflow`](https://huggingface.co/transformers/debugging.html). - Using pytorch fwd/bwd hooks log multiple things about each model's submodule and its args. 1. fwd/bwd...

jaketae

Compute model param count once

This PR fixes #203 by using the `args` global variable holder to save and access model parameter counts during gigaflops counting. This is sensible given that the number of model...

jaketae

Checking we use fused kernels to compute scaled masked softmax on prefix lm

- Related to: #209 Basically re-opening the PR as it seems to pass locally but not CI.

thomasw21

Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard

Metadata

Preprocessing from arrow file to load an HF dataset

launch debug code

possibly --skip-train-iteration-range with multiple entries has a bug

Add `_validate_args`

adding consistency calculations/checks at init time

Error in Installation due to circular import

Make sure deepspeed powered models are equivalent with their non deepspeed version

[debug] ModelInspector

Compute model param count once

Checking we use fused kernels to compute scaled masked softmax on prefix lm

← Metadata

Owner

Metadata

Megatron-DeepSpeed Megatron-DeepSpeed copied to clipboard

Metadata

← Metadata

Owner

Metadata

Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard