OneFlow-Benchmark
OneFlow-Benchmark copied to clipboard
CNN benchmark cannot run
I follow the instruction in your CNN benchamrk training resnet50 with sync data. After I exec train.sh
, It failed with the following information. Can you offer some help?
------------------------------------------------------------------
Time stamp: 2020-09-15-13:38:02
Traceback (most recent call last):
File "of_cnn_train_val.py", line 64, in <module>
@flow.global_function("train", get_train_config(args))
File "/home/duanjiangfei/OneFlow-Benchmark/Classification/cnns/job_function_util.py", line 33, in get_train_config
train_config = _default_config(args)
File "/home/duanjiangfei/OneFlow-Benchmark/Classification/cnns/job_function_util.py", line 28, in _default_config
config.enable_fuse_add_to_output(True)
File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/function_util.py", line 54, in __getattr__
assert attr_name in name2default
AssertionError
enable_fuse_add_to_output
is a new feature which can speed up resnet50 training speed.
Please try comment line of config.enable_fuse_add_to_output(True)
to avoid this error.
@ShawnXuan After that, I meet other errors. It seems the version of oneflow-benchmark is not consistent with the version of oneflow, thus it has many errors.
I can train BERT in a single node. But for two node, I use this scripts
NUM_NODES=$1
NODE_IPS=$2
DATA_DIR=/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/wiki_ofrecord_seq_len_128_example
python run_pretraining.py \
--gpu_num_per_node=8 \
--num_nodes=$NUM_NODES \
--node_ips=$NODE_IPS \
--learning_rate=1e-4 \
--batch_size_per_device=64 \
--iter_num=100 \
--loss_print_every_n_iter=20 \
--seq_length=128 \
--max_predictions_per_seq=20 \
--num_hidden_layers=12 \
--num_attention_heads=12 \
--max_position_embeddings=512 \
--type_vocab_size=2 \
--vocab_size=30522 \
--attention_probs_dropout_prob=0.1 \
--hidden_dropout_prob=0.1 \
--hidden_size_per_head=64 \
--data_part_num=1 \
--data_dir=$DATA_DIR \
--log_dir=./log \
--model_save_every_n_iter=10000 \
--save_last_snapshot=False \
--model_save_dir=./snapshots
But I get the following error
Time stamp: 2020-09-16-01:52:58
[libprotobuf ERROR /oneflow-src/manylinux2014-build-cache-cuda-10.1/build-third-party/protobuf/src/protobuf/src/google/proto
buf/text_format.cc:303] Error parsing text-format oneflow.EnvProto: Message missing required fields: ctrl_port
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0916 01:52:58.912607 194902 error.cpp:26] Check failed: TxtString2PbMessage(env_proto_str, &env_proto) failed to pa
rse env_protomachine {
id: 0
addr: "10.5.8.54"
}
machine {
id: 1
addr: "10.5.8.69"
}
cpp_logging_conf {
log_dir: "./log"
}
grpc_use_no_signal: true
Traceback (most recent call last):
File "run_pretraining.py", line 120, in <module>
main()
File "run_pretraining.py", line 104, in main
snapshot = Snapshot(args.model_save_dir, args.model_load_dir)
File "/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/util.py", line 48, in __init__
self._check_point.init()
File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_context.py", line 49$
in Func
GetDefaultSession().TryInit()
File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 204, $
n TryInit
self.Init()
File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 211, $
n Init
oneflow.env.init()
File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 53, in api
_env_init
return enable_if.unique([env_init, do_nothing])()
File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 61, in env
_init
c_api_util.InitEnv(default_env_proto)
File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/c_api_util.py", line 100, in
InitEnv
raise JobBuildAndInferError(error)
oneflow.python.framework.job_build_and_infer_error.JobBuildAndInferError:
error msg:
check_failed_error {
}
Check failed: TxtString2PbMessage(env_proto_str, &env_proto) failed to parse env_protomachine {
id: 0
addr: "10.5.8.54"
}
machine {
id: 1
addr: "10.5.8.69"
}
cpp_logging_conf {
log_dir: "./log"
}
grpc_use_no_signal: true
Do you know what the reason is. @ShawnXuan
I can train BERT in a single node. But for two node, I use this scripts
NUM_NODES=$1 NODE_IPS=$2 DATA_DIR=/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/wiki_ofrecord_seq_len_128_example python run_pretraining.py \ --gpu_num_per_node=8 \ --num_nodes=$NUM_NODES \ --node_ips=$NODE_IPS \ --learning_rate=1e-4 \ --batch_size_per_device=64 \ --iter_num=100 \ --loss_print_every_n_iter=20 \ --seq_length=128 \ --max_predictions_per_seq=20 \ --num_hidden_layers=12 \ --num_attention_heads=12 \ --max_position_embeddings=512 \ --type_vocab_size=2 \ --vocab_size=30522 \ --attention_probs_dropout_prob=0.1 \ --hidden_dropout_prob=0.1 \ --hidden_size_per_head=64 \ --data_part_num=1 \ --data_dir=$DATA_DIR \ --log_dir=./log \ --model_save_every_n_iter=10000 \ --save_last_snapshot=False \ --model_save_dir=./snapshots
But I get the following error
Time stamp: 2020-09-16-01:52:58 [libprotobuf ERROR /oneflow-src/manylinux2014-build-cache-cuda-10.1/build-third-party/protobuf/src/protobuf/src/google/proto buf/text_format.cc:303] Error parsing text-format oneflow.EnvProto: Message missing required fields: ctrl_port WARNING: Logging before InitGoogleLogging() is written to STDERR E0916 01:52:58.912607 194902 error.cpp:26] Check failed: TxtString2PbMessage(env_proto_str, &env_proto) failed to pa rse env_protomachine { id: 0 addr: "10.5.8.54" } machine { id: 1 addr: "10.5.8.69" } cpp_logging_conf { log_dir: "./log" } grpc_use_no_signal: true Traceback (most recent call last): File "run_pretraining.py", line 120, in <module> main() File "run_pretraining.py", line 104, in main snapshot = Snapshot(args.model_save_dir, args.model_load_dir) File "/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/util.py", line 48, in __init__ self._check_point.init() File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_context.py", line 49$ in Func GetDefaultSession().TryInit() File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 204, $ n TryInit self.Init() File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 211, $ n Init oneflow.env.init() File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 53, in api _env_init return enable_if.unique([env_init, do_nothing])() File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 61, in env _init c_api_util.InitEnv(default_env_proto) File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/c_api_util.py", line 100, in InitEnv raise JobBuildAndInferError(error) oneflow.python.framework.job_build_and_infer_error.JobBuildAndInferError: error msg: check_failed_error { } Check failed: TxtString2PbMessage(env_proto_str, &env_proto) failed to parse env_protomachine { id: 0 addr: "10.5.8.54" } machine { id: 1 addr: "10.5.8.69" } cpp_logging_conf { log_dir: "./log" } grpc_use_no_signal: true
Do you know what the reason is. @ShawnXuan
@JF-D sorry. this is a stupid mistake in the script. Please uncomment the following line. We will update the script today.
https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/5c2b305312ee9141208fad71cba3a9f05da69dd4/LanguageModeling/BERT/util.py#L29
Thanks a lot. The Bert benchmark can run successfully. But for cnn benchmark, I cannot run due to https://github.com/Oneflow-Inc/OneFlow-Benchmark/issues/130#issuecomment-692714197
Thanks a lot. The Bert benchmark can run successfully. But for cnn benchmark, I cannot run due to #130 (comment)
Thanks. this is due to the incompatibility between benchmark scripts and the oneflow release. You can try building from source to use the latest oneflow. We will also try our best to release a new version ASAP.
The default value of fuse_bn_relu
and fuse_bn_add_relu
was changed to False
temporary, and will be back to True
after next oneflow release. Please update your code, it should be fixed. thanks! @JF-D
@ShawnXuan Thanks. @yuanms2 I think you should add some git tags to clarify different versions of benchmark.
I have one more question. You only realse the speed of bert base model, have you tried bert large model? I can get similar speed using the benchmark. Bert base throughput 145 samples/s. My machine is 32G V100(SXM2) + pytorch1.5 + cuda 10.1. Since I have some experiment results of bert large model I tested about 2 months ago, I compare it with the oneflow benchmark. The oneflow bert large is about 45 samples/s (~ 1400ms/iter), for my pytorch result, it is about 800 ms/iter (single card + bs64). This result doesn't quite match benchmark.
@ShawnXuan Thanks. @yuanms2 I think you should add some git tags to clarify different versions of benchmark.
I have one more question. You only realse the speed of bert base model, have you tried bert large model? I can get similar speed using the benchmark. Bert base throughput 145 samples/s. My machine is 32G V100(SXM2) + pytorch1.5 + cuda 10.1. Since I have some experiment results of bert large model I tested about 2 months ago, I compare it with the oneflow benchmark. The oneflow bert large is about 45 samples/s (~ 1400ms/iter), for my pytorch result, it is about 800 ms/iter (single card + bs64). This result doesn't quite match benchmark.
@JF-D thank you. We will look into the bert-large training.