NeMo
NeMo copied to clipboard
Train LM with a large file
I try to train ~5.6G data with ~700M validation
Using this below command:
python /workspace/data/NeMo/examples/nlp/language_modeling/bert_pretraining.py
--config-name=/workspace/data/NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml
model.train_ds.data_file="/workspace/data/NeMo/lm/data/public-data/train.txt"
model.validation_ds.data_file="/workspace/data/NeMo/lm/data/public-data/val.txt"
model.train_ds.batch_size=128
model.optim.lr=5e-5
trainer.max_epochs=1
trainer.gpus=1
Then it show this error
GPU available: True, used: True TPU available: False, using: 0 TPU cores Using native 16bit precision. [NeMo I 2022-08-09 02:50:20 exp_manager:216] Experiments will be logged at /workspace/data/NeMo/public-data-lm/nemo_experiments/PretrainingBERTFromText/2022-08-09_02-50-20 [NeMo I 2022-08-09 02:50:20 exp_manager:563] TensorboardLogger has been set up 0%| | 0/1 [00:48<?, ?it/s] Traceback (most recent call last): File "/workspace/data/bew/NeMo/examples/nlp/language_modeling/bert_pretraining.py", line 31, in main bert_model = BERTLMModel(cfg.model, trainer=trainer) File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/models/language_modeling/bert_lm_model.py", line 63, in init super().init(cfg=cfg, trainer=trainer) File "/opt/conda/lib/python3.6/site-packages/nemo/core/classes/modelPT.py", line 127, in init self.setup_training_data(self._cfg.train_ds) File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/models/language_modeling/bert_lm_model.py", line 192, in setup_training_data else self._setup_dataloader(train_data_config) File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/models/language_modeling/bert_lm_model.py", line 235, in _setup_dataloader short_seq_prob=cfg.short_seq_prob, File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/data/language_modeling/lm_bert_dataset.py", line 90, in init sentence_indices[filename] = array.array("I", newline_indices) OverflowError: unsigned int is greater than maximum
I can train with the small size data, training is fine but the large file I got this error
Environment details
- Ubuntu
- Pytorch 1.8.0a0+17f8c32
- Pytorch-lightning 1.2.10
- Python 3.6.10
- DGX-V100 32G
- NeMO 1.0.0rc1
Hi, it looks like you are using an old version of NeMo. Could you try pretraining BERT with our latest release 1.10.0, and using our new NeMo Megatron BERT training script: https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_bert_pretraining.py
The data preprocessing script is here: https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py
NeMo Megatron BERT can be scaled to large dataset sizes and large model sizes as well.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.