models
models copied to clipboard
Can not run BERT-Large training successfully on bare metal
We run BERT-Large training on bare metal ubuntu server. The log have no errors, but also no training logs, it is confusing.
command:
python ./launch_benchmark.py \
--model-name=bert_large \
--precision=fp32 \
--mode=training \
--framework=tensorflow \
--batch-size=24 \
--benchmark-only \
--data-location=$BERT_LARGE_DIR \
--num-inter-threads=1 \
-- train-option=SQuAD DEBIAN_FRONTEND=noninteractive config_file=$BERT_LARGE_DIR/bert_config.json init_checkpoint=$BERT_LARGE_DIR/bert_model.ckpt vocab_file=$BERT_LARGE_DIR/vocab.txt train_file=$SQUAD_DIR/train-v1.1.json predict_file=$SQUAD_DIR/dev-v1.1.json do-train=True learning-rate=1.5e-5 max-seq-length=384 do_predict=True warmup-steps=0 num_train_epochs=0.1 doc_stride=128 do_lower_case=False experimental-gelu=False mpi_workers_sync_gradients=True
The log:
INFO:tensorflow:Graph was finalized.
I0625 09:40:30.595448 140247941625664 monitored_session.py:246] Graph was finalized.
2021-06-25 09:40:30.595915: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-25 09:40:30.764862: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2892875000 Hz
2021-06-25 09:40:30.767997: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c703127e80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-25 09:40:30.768068: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Running local_init_op.
I0625 09:40:50.980941 140247941625664 session_manager.py:505] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0625 09:40:51.142987 140247941625664 session_manager.py:508] Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
I0625 09:41:02.433922 140247941625664 basic_session_run_hooks.py:614] Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /home/shen/models/benchmarks/common/tensorflow/logs/model.ckpt.
I0625 09:41:02.434337 140247941625664 basic_session_run_hooks.py:618] Saving checkpoints for 0 into /home/shen/models/benchmarks/common/tensorflow/logs/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
I0625 09:41:08.454857 140247941625664 basic_session_run_hooks.py:626] Calling checkpoint listeners after saving checkpoint 0...
INFO:Running SQuAD...!
----------------------------Run command-------------------------------------
So there are no training result in the log.
@dmsuehir @ashahba would you please help troubleshoot
Thanks
Hi @dmsuehir , do you have ideas for this issue.
Thanks
@zhixingheyi-tian Is this the same issue that's being discussed in the email thread with Wei? It sounded like the next steps were to make sure that you are pip installing intel-tensorflow instead of just tensorflow.
@zhixingheyi-tian: can you confirm if the issue is resolved? If not, can you try our latest optimizations for BERT-Large training here: https://github.com/IntelAI/models/tree/r3.1/quickstart/language_modeling/tensorflow/bert_large/training/cpu ?