Megatron-LM
Megatron-LM copied to clipboard
[BUG] Megatron Core example not working
Describe the bug
The provided example script run_simple_mcore_train_loop.py throws errors in Step 3: GPT Mock dataset setup utility.
To Reproduce
For simplicity, the example is run with a single GPU with tensor_model_parallel_size=1 and pipeline_model_parallel_size=1.
srun python -u run_simple_mcore_train_loop.py
Stack trace/logs
[rank0]: Traceback (most recent call last):
[rank0]: File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 115, in <module>
[rank0]: train_iterator = get_train_data_iterator()
[rank0]: File "/scratch/sd/u/user/Megatron-LM/examples/run_simple_mcore_train_loop.py", line 55, in get_train_data_iterator
[rank0]: config = GPTDatasetConfig(
[rank0]: File "<string>", line 18, in __init__
[rank0]: File "/scratch/sd/u/user/Megatron-LM/megatron/core/datasets/gpt_dataset.py", line 52, in __post_init__
[rank0]: super().__post_init__()
[rank0]: File "/scratch/sd/u/user/Megatron-LM/megatron/core/datasets/blended_megatron_dataset_config.py", line 87, in __post_init__
[rank0]: assert self.split is not None, "split must be provided in absence of blend_per_split"
[rank0]: AssertionError: split must be provided in absence of blend_per_split
Environment (please complete the following information):
- Megatron-LM commit ID:
a5534c8 - PyTorch version:
2.4.0a0+gitd957c2d - CUDA version: 12.2
- NCCL version: 2.19.4
Proposed fix N/A
Additional context
Upon applying a temporary fix in get_train_data_iterator for the GPT config with split='1', other errors are thrown when creating an object of class MockGPTDataset. Additionally, this GPT config refers to a dummy tokenizer which is missing.
Any assistance with resolving this issue would be appreciated, thank you!
I also meet this problem
same
6c7bec6 fixes this issue partially. A split argument is required to be passed to GPTDatasetConfig (from BlendedMegatronDatasetConfig) if blend is None to make the example work.
Marking as stale. No activity in 60 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
This issue has been resolved. The current version of run_simple_mcore_train_loop.py correctly:
- Uses BlendedMegatronDatasetBuilder with the split information provided directly ([1000, None, None])
- Includes the proper tokenizer instantiation with _NullTokenizer(vocab_size=_SEQUENCE_LENGTH)
- Properly initializes MockGPTDataset through the builder