NeMo Train LM with a large file

I try to train ~5.6G data with ~700M validation

Using this below command:

python /workspace/data/NeMo/examples/nlp/language_modeling/bert_pretraining.py --config-name=/workspace/data/NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml model.train_ds.data_file="/workspace/data/NeMo/lm/data/public-data/train.txt"
model.validation_ds.data_file="/workspace/data/NeMo/lm/data/public-data/val.txt"
model.train_ds.batch_size=128 model.optim.lr=5e-5 trainer.max_epochs=1 trainer.gpus=1

Then it show this error

GPU available: True, used: True TPU available: False, using: 0 TPU cores Using native 16bit precision. [NeMo I 2022-08-09 02:50:20 exp_manager:216] Experiments will be logged at /workspace/data/NeMo/public-data-lm/nemo_experiments/PretrainingBERTFromText/2022-08-09_02-50-20 [NeMo I 2022-08-09 02:50:20 exp_manager:563] TensorboardLogger has been set up 0%| | 0/1 [00:48<?, ?it/s] Traceback (most recent call last): File "/workspace/data/bew/NeMo/examples/nlp/language_modeling/bert_pretraining.py", line 31, in main bert_model = BERTLMModel(cfg.model, trainer=trainer) File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/models/language_modeling/bert_lm_model.py", line 63, in init super().init(cfg=cfg, trainer=trainer) File "/opt/conda/lib/python3.6/site-packages/nemo/core/classes/modelPT.py", line 127, in init self.setup_training_data(self._cfg.train_ds) File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/models/language_modeling/bert_lm_model.py", line 192, in setup_training_data else self._setup_dataloader(train_data_config) File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/models/language_modeling/bert_lm_model.py", line 235, in _setup_dataloader short_seq_prob=cfg.short_seq_prob, File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/data/language_modeling/lm_bert_dataset.py", line 90, in init sentence_indices[filename] = array.array("I", newline_indices) OverflowError: unsigned int is greater than maximum

I can train with the small size data, training is fine but the large file I got this error

Environment details

Ubuntu
Pytorch 1.8.0a0+17f8c32
Pytorch-lightning 1.2.10
Python 3.6.10
DGX-V100 32G
NeMO 1.0.0rc1

Aug 09 '22 05:08 Phakkhamat

Hi, it looks like you are using an old version of NeMo. Could you try pretraining BERT with our latest release 1.10.0, and using our new NeMo Megatron BERT training script: https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_bert_pretraining.py

The data preprocessing script is here: https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py

NeMo Megatron BERT can be scaled to large dataset sizes and large model sizes as well.

Aug 24 '22 05:08 ericharper

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Oct 06 '22 02:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Oct 13 '22 02:10 github-actions[bot]

NeMo NeMo copied to clipboard

Train LM with a large file

NeMo
NeMo copied to clipboard