Megatron-LM
Megatron-LM copied to clipboard
[BUG] GPTDataset._build_document_sample_shuffle_indices does not build the indices on non-root nodes when not using NFS
Describe the bug If the training data does not live on NFS but on node-specific storage, the current logic in https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/core/datasets/gpt_dataset.py#L346 skips building the indices and result in an error when loading the document index at https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/core/datasets/gpt_dataset.py#L484, complaining that the file does not exist.
To Reproduce Try running multi-node training, pointing to training data not living on NFS.
Expected behavior Ideally there should be a flag indicating whether the data storage is shared file system. If not, the index needs to be built on each node separately.
Stack trace/logs
(worker6, rank=6, pid=8930, ip=10.42.3.242) File "/opt/megatron-lm/megatron/core/datasets/blended_megatron_dataset_builder.py", line 470, in build_generic_dataset
(worker6, rank=6, pid=8930, ip=10.42.3.242) dataset = cls(*args)
(worker6, rank=6, pid=8930, ip=10.42.3.242) File "/opt/megatron-lm/megatron/core/datasets/gpt_dataset.py", line 111, in __init__
(worker6, rank=6, pid=8930, ip=10.42.3.242) ) = self._build_document_sample_shuffle_indices()
(worker6, rank=6, pid=8930, ip=10.42.3.242) File "/opt/megatron-lm/megatron/core/datasets/gpt_dataset.py", line 474, in _build_document_sample_shuffle_indices
(worker6, rank=6, pid=8930, ip=10.42.3.242) document_index = numpy.load(path_to_document_index, allow_pickle=True, mmap_mode='r')
(worker6, rank=6, pid=8930, ip=10.42.3.242) File "/usr/local/lib/python3.10/dist-packages/numpy/lib/npyio.py", line 405, in load
(worker6, rank=6, pid=8930, ip=10.42.3.242) fid = stack.enter_context(open(os_fspath(file), "rb"))
(worker6, rank=6, pid=8930, ip=10.42.3.242) FileNotFoundError: [Errno 2] No such file or directory: '/wiki/mistral_7b_v0.3_training_data_text_document/cache/GPTDataset_indices/81e3d4d910e734899c56ceb4ba98b98c-GPTDataset-train-document_index.npy'
Environment (please complete the following information):
- Megatron-LM commit ID: not sure how to check; using NeMo and
nvcr.io/nvidia/nemo:24.05.01
- PyTorch version:
2.3.0a0+ebedce2
- CUDA version: 12.4
- NCCL version: 2.20.3
Proposed fix My workaround is the following patch:
--- /opt/megatron-lm/megatron/core/datasets/gpt_dataset.py 2024-07-07 03:48:09.635073980 +0000
+++ /opt/megatron-lm/megatron/core/datasets/gpt_dataset.py.new 2024-07-07 03:48:07.383130640 +0000
@@ -8,6 +8,7 @@
import numpy
import torch
+import torch.distributed
from megatron.core.datasets.blended_megatron_dataset_config import BlendedMegatronDatasetConfig
from megatron.core.datasets.indexed_dataset import IndexedDataset
@@ -342,7 +343,7 @@
if not path_to_cache or (
not cache_hit
- and (not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0)
+ and (not torch.distributed.is_initialized() or os.environ.get('LOCAL_RANK', '0') == '0')
):
log_single_rank(
@@ -459,7 +460,9 @@
)
log_single_rank(logger, logging.INFO, f"> total number of epochs: {num_epochs}")
- return document_index, sample_index, shuffle_index
+ # return document_index, sample_index, shuffle_index
+
+ torch.distributed.barrier()
log_single_rank(
logger, logging.INFO, f"Load the {type(self).__name__} {self.index_split.name} indices"
But it does not offer the flexibility of a flag.