Megatron-LM issues

[Logging] Surface up loader module import error

1

## Why? The loader import errors is swallowed if the root cause is not loader_X.py not found. For example, if i don't have `transformers` installed, it also printed loader_X not...

ByronHsu

stale

[doc] `--workers` is missing in data preprocessing example script

1

`--workers` is required as in[ preprocess_data.py](https://github.com/NVIDIA/Megatron-LM/blob/0052bf0de70b266d8648e2655da16f7eb2c9ca56/tools/preprocess_data.py#L223), but it is missing in readme.

ByronHsu

stale

[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format

2

I am trying to convert the weight for `vicuna-7b-v1.5 `in huggingface transformers ( https://huggingface.co/lmsys/vicuna-7b-v1.5 ) to be used with megatron-lm. I am using `tools/checkpoint/convert.py` to do the conversion. The command...

uehara-mech

Fix llama converter.

Victarry

[BUG] ConstantGradScaler and loss-scale argument not match

**Describe the bug** The usage and description of loss-scale is inconsistent. The argument of loss-scale expect to get a number of positive power of 2 but ConstantGradScaler set loss-scale to...

BeingGod

Code fixes for local-storage-only environment

2

In certain virtualized environment there is no shared storage. Both source code and data are stored (replicated) in each worker node's local storage. The code sections below only load data...

learning-chip

[BUG] Passed the wrong type of argument to torch.distributed.broadcast.

**Describe the bug** ```python def broadcast_params(self): """ Syncs parameters across all DP ranks. """ for param in self.module.parameters(): is_expert_parallel = not getattr(param, 'allreduce', True) if is_expert_parallel: torch.distributed.broadcast( param.data, src=torch.distributed.get_process_group_ranks(self.expert_data_parallel_group), group=self.expert_data_parallel_group,...

sandyhouse

Update pretrain_bert.py

ocryptocode

[QUESTION]Why forward_backward_pipelining_without_interleaving cannot open config.overlap_p2p_comm?

2

forward_backward_pipelining_with_interleaving has a branch of opening config.overlap_p2p_comm, why does forward_backward_pipelining_without_interleaving not have?

zhouyiyuan-mt

stale

[BUG] ModuleNotFoundError: No module named 'scaled_softmax_cuda'

4

**Describe the bug** When I try to run single GPU T5 Pretraining with the script `examples/pretrain_t5.sh`, it outputs the following error: > ModuleNotFoundError: No module named 'scaled_softmax_cuda' It seems that...

liuliuliu0605

Megatron-LM
Megatron-LM copied to clipboard

Metadata

[Logging] Surface up loader module import error

[doc] `--workers` is missing in data preprocessing example script

[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format

Fix llama converter.

[BUG] ConstantGradScaler and loss-scale argument not match

Code fixes for local-storage-only environment

[BUG] Passed the wrong type of argument to torch.distributed.broadcast.

Update pretrain_bert.py

[QUESTION]Why forward_backward_pipelining_without_interleaving cannot open config.overlap_p2p_comm?

[BUG] ModuleNotFoundError: No module named 'scaled_softmax_cuda'

← Metadata

Owner

Metadata

Megatron-LM Megatron-LM copied to clipboard

Metadata

← Metadata

Owner

Metadata

Megatron-LM
Megatron-LM copied to clipboard