bert create_pretraining_data.py kept killed..

Hello. I'm working on pretraining BERT project using GCP (Google Cloud Platform).

Before I get started to use TPU for executing run_pretraining.py, I got stuck in creating pretraining data.

Here is the .sh scripts for create_pretraining_data.py

python3 create_pretraining_data.py \ --input_file $DATA_DIR/data_1.txt \ --output_file $OUTPUT_DIR \ --do_lower_case=True \ --do_whole_word_mask=True \ --max_seq_length 512 \ --max_predictions_per_seq 70 \ --masked_lm_prob 0.15 \ --vocab_file $VOCAB_DIR \ --codes_file $CODES_DIR \ --dupe_factor 1

The input text's size is about 40GB but it seems too large so I splited the data into 18 files and each file's size is about 1.2GB.

At first I tried to set dupe_factor as 10 but it also seems to raise memory issues so I just set dupe_factor as 1 and try to repeat 10 times assigning different random_seed.

Although I tried to execute create_pretraining_data.py with minimum environment, it kept killed and I've done only 1 file among 18 files.

It happened on both GCP and my local.

Have any idea to solve this "catastrophic" situation? This project has been delayed because of this issue and I don't know what to do anymore..

Jul 24 '19 06:07 lxxhxxxjxxxx

Seems to be an OOM problem. Have you tried to feed a small text (~1M) into script?

Jul 24 '19 08:07 ymcui

@ymcui Yes and it definitely worked out well. I also think that memory matters since the logic of create_pretraining_data.py is not that memory-efficient.

Jul 24 '19 09:07 lxxhxxxjxxxx

I would maybe increase the number of shards from 18 to maybe 32. Then run create_pretraining_data.py individually on each shard one by one (make a quick script to automate this).

Jul 25 '19 03:07 jaymody

This is especially the maximum length, but as you are on TPU this should not be a problem. Can you lower it to see if there is still an OOM?

Jul 31 '19 13:07 salahalaoui

If this helps anyone:

I started with data file size over 3G with over 7 million sentences, VM was running out of RAM after a couple hours of run ( I had about 102G of RAM on VM), eventually leading system to the resource starvation state with weird errors.

If you do not have infinite RAM, as remediation, you can instead shard the data files like below:

split -d -l 250000 data_file.txt data_file_shard

I chose 250k lines per file, and it worked. You can try different size based on your system configuration.

Post this, I am able to generate n number of tf_trecord files. The run_pretraining.py step can take input as globs like tf_examples.tf_record* and hence this small addition step solved the issue completing over 3G of data processing in about 2-3 hours. I can share scripts if anyone still has issues on how to split and automated way to loop over n number of files creating tfrecords ...

Good luck!!

Aug 27 '19 19:08 anshoomehra

If this helps anyone:

I started with data file size over 3G with over 7 million sentences, VM was running out of RAM after a couple hours of run ( I had about 102G of RAM on VM), eventually leading system to the resource starvation state with weird errors.

If you do not have infinite RAM, as remediation, you can instead shard the data files like below:

split -d -l 250000 data_file.txt data_file_shard

I chose 250k lines per file, and it worked. You can try different size based on your system configuration.

Post this, I am able to generate n number of tf_trecord files. The run_pretraining.py step can take input as globs like tf_examples.tf_record* and hence this small addition step solved the issue completing over 3G of data processing in about 2-3 hours. I can share scripts if anyone still has issues on how to split and automated way to loop over n number of files creating tfrecords ...

Good luck!!

Script? Thanks

Oct 19 '19 02:10 calusbr

follow @anshoomehra's suggestions and it works: run the bash below remember to add the path on top of python file

Jul 06 '21 04:07 ziyicui2022

bert bert copied to clipboard

create_pretraining_data.py kept killed..

bert
bert copied to clipboard