Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG] Megatron Core example not working

Open schheda1 opened this issue 1 year ago • 4 comments

Describe the bug The provided example script run_simple_mcore_train_loop.py throws errors in Step 3: GPT Mock dataset setup utility.

To Reproduce For simplicity, the example is run with a single GPU with tensor_model_parallel_size=1 and pipeline_model_parallel_size=1. srun python -u run_simple_mcore_train_loop.py

Stack trace/logs

[rank0]: Traceback (most recent call last):
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 115, in <module>
[rank0]:     train_iterator = get_train_data_iterator()
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 55, in get_train_data_iterator
[rank0]:     config = GPTDatasetConfig(
[rank0]:   File "<string>", line 18, in __init__
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/datasets/gpt_dataset.py", line 52, in __post_init__
[rank0]:     super().__post_init__()
[rank0]:   File "/scratch/sd/u/user/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_config.py", line 87, in __post_init__
[rank0]:     assert self.split is not None, "split must be provided in absence of blend_per_split"
[rank0]: AssertionError: split must be provided in absence of blend_per_split

Environment (please complete the following information):

  • Megatron-LM commit ID: a5534c8
  • PyTorch version: 2.4.0a0+gitd957c2d
  • CUDA version: 12.2
  • NCCL version: 2.19.4

Proposed fix N/A

Additional context Upon applying a temporary fix in get_train_data_iterator for the GPT config with split='1', other errors are thrown when creating an object of class MockGPTDataset. Additionally, this GPT config refers to a dummy tokenizer which is missing.

Any assistance with resolving this issue would be appreciated, thank you!

schheda1 avatar Jun 03 '24 20:06 schheda1

I also meet this problem

Qinghao-Hu avatar Jun 04 '24 07:06 Qinghao-Hu

same

windprak avatar Jun 14 '24 14:06 windprak

6c7bec6 fixes this issue partially. A split argument is required to be passed to GPTDatasetConfig (from BlendedMegatronDatasetConfig) if blend is None to make the example work.

schheda1 avatar Jun 18 '24 16:06 schheda1

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Aug 17 '24 18:08 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Aug 03 '25 02:08 github-actions[bot]

This issue has been resolved. The current version of run_simple_mcore_train_loop.py correctly:

  1. Uses BlendedMegatronDatasetBuilder with the split information provided directly ([1000, None, None])
  2. Includes the proper tokenizer instantiation with _NullTokenizer(vocab_size=_SEQUENCE_LENGTH)
  3. Properly initializes MockGPTDataset through the builder

sbhavani avatar Oct 12 '25 19:10 sbhavani