NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Train LM with a large file

Open Phakkhamat opened this issue 3 years ago • 1 comments

I try to train ~5.6G data with ~700M validation

Using this below command:

python /workspace/data/NeMo/examples/nlp/language_modeling/bert_pretraining.py --config-name=/workspace/data/NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml model.train_ds.data_file="/workspace/data/NeMo/lm/data/public-data/train.txt"
model.validation_ds.data_file="/workspace/data/NeMo/lm/data/public-data/val.txt"
model.train_ds.batch_size=128 model.optim.lr=5e-5 trainer.max_epochs=1 trainer.gpus=1

Then it show this error

GPU available: True, used: True TPU available: False, using: 0 TPU cores Using native 16bit precision. [NeMo I 2022-08-09 02:50:20 exp_manager:216] Experiments will be logged at /workspace/data/NeMo/public-data-lm/nemo_experiments/PretrainingBERTFromText/2022-08-09_02-50-20 [NeMo I 2022-08-09 02:50:20 exp_manager:563] TensorboardLogger has been set up 0%| | 0/1 [00:48<?, ?it/s] Traceback (most recent call last): File "/workspace/data/bew/NeMo/examples/nlp/language_modeling/bert_pretraining.py", line 31, in main bert_model = BERTLMModel(cfg.model, trainer=trainer) File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/models/language_modeling/bert_lm_model.py", line 63, in init super().init(cfg=cfg, trainer=trainer) File "/opt/conda/lib/python3.6/site-packages/nemo/core/classes/modelPT.py", line 127, in init self.setup_training_data(self._cfg.train_ds) File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/models/language_modeling/bert_lm_model.py", line 192, in setup_training_data else self._setup_dataloader(train_data_config) File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/models/language_modeling/bert_lm_model.py", line 235, in _setup_dataloader short_seq_prob=cfg.short_seq_prob, File "/opt/conda/lib/python3.6/site-packages/nemo/collections/nlp/data/language_modeling/lm_bert_dataset.py", line 90, in init sentence_indices[filename] = array.array("I", newline_indices) OverflowError: unsigned int is greater than maximum

I can train with the small size data, training is fine but the large file I got this error

Environment details

  • Ubuntu
  • Pytorch 1.8.0a0+17f8c32
  • Pytorch-lightning 1.2.10
  • Python 3.6.10
  • DGX-V100 32G
  • NeMO 1.0.0rc1

Phakkhamat avatar Aug 09 '22 05:08 Phakkhamat

Hi, it looks like you are using an old version of NeMo. Could you try pretraining BERT with our latest release 1.10.0, and using our new NeMo Megatron BERT training script: https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_bert_pretraining.py

The data preprocessing script is here: https://github.com/NVIDIA/NeMo/blob/main/scripts/nlp_language_modeling/preprocess_data_for_megatron.py

NeMo Megatron BERT can be scaled to large dataset sizes and large model sizes as well.

ericharper avatar Aug 24 '22 05:08 ericharper

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Oct 06 '22 02:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Oct 13 '22 02:10 github-actions[bot]