OneFlow-Benchmark icon indicating copy to clipboard operation
OneFlow-Benchmark copied to clipboard

CNN benchmark cannot run

Open JF-D opened this issue 4 years ago • 9 comments

I follow the instruction in your CNN benchamrk training resnet50 with sync data. After I exec train.sh, It failed with the following information. Can you offer some help?

------------------------------------------------------------------
Time stamp: 2020-09-15-13:38:02
Traceback (most recent call last):
  File "of_cnn_train_val.py", line 64, in <module>
    @flow.global_function("train", get_train_config(args))
  File "/home/duanjiangfei/OneFlow-Benchmark/Classification/cnns/job_function_util.py", line 33, in get_train_config
    train_config = _default_config(args)
  File "/home/duanjiangfei/OneFlow-Benchmark/Classification/cnns/job_function_util.py", line 28, in _default_config
    config.enable_fuse_add_to_output(True)
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/function_util.py", line 54, in __getattr__
    assert attr_name in name2default
AssertionError

JF-D avatar Sep 15 '20 05:09 JF-D

enable_fuse_add_to_output is a new feature which can speed up resnet50 training speed. Please try comment line of config.enable_fuse_add_to_output(True) to avoid this error.

ShawnXuan avatar Sep 15 '20 09:09 ShawnXuan

@ShawnXuan After that, I meet other errors. It seems the version of oneflow-benchmark is not consistent with the version of oneflow, thus it has many errors. image

JF-D avatar Sep 15 '20 13:09 JF-D

I can train BERT in a single node. But for two node, I use this scripts

NUM_NODES=$1
NODE_IPS=$2

DATA_DIR=/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/wiki_ofrecord_seq_len_128_example
python run_pretraining.py \
  --gpu_num_per_node=8 \
  --num_nodes=$NUM_NODES \
  --node_ips=$NODE_IPS \
  --learning_rate=1e-4 \
  --batch_size_per_device=64 \
  --iter_num=100 \
  --loss_print_every_n_iter=20 \
  --seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_hidden_layers=12 \
  --num_attention_heads=12 \
  --max_position_embeddings=512 \
  --type_vocab_size=2 \
  --vocab_size=30522 \
  --attention_probs_dropout_prob=0.1 \
  --hidden_dropout_prob=0.1 \
  --hidden_size_per_head=64 \
  --data_part_num=1 \
  --data_dir=$DATA_DIR \
  --log_dir=./log \
  --model_save_every_n_iter=10000 \
  --save_last_snapshot=False \
  --model_save_dir=./snapshots

But I get the following error

Time stamp: 2020-09-16-01:52:58
[libprotobuf ERROR /oneflow-src/manylinux2014-build-cache-cuda-10.1/build-third-party/protobuf/src/protobuf/src/google/proto
buf/text_format.cc:303] Error parsing text-format oneflow.EnvProto: Message missing required fields: ctrl_port
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0916 01:52:58.912607 194902 error.cpp:26]  Check failed: TxtString2PbMessage(env_proto_str, &env_proto)        failed to pa
rse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true
Traceback (most recent call last):
  File "run_pretraining.py", line 120, in <module>
    main()
  File "run_pretraining.py", line 104, in main
    snapshot = Snapshot(args.model_save_dir, args.model_load_dir)
  File "/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/util.py", line 48, in __init__
    self._check_point.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_context.py", line 49$
 in Func
    GetDefaultSession().TryInit()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 204, $
n TryInit
    self.Init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 211, $
n Init
    oneflow.env.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 53, in api
_env_init
    return enable_if.unique([env_init, do_nothing])()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 61, in env
_init
    c_api_util.InitEnv(default_env_proto)
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/c_api_util.py", line 100, in
InitEnv
    raise JobBuildAndInferError(error)
oneflow.python.framework.job_build_and_infer_error.JobBuildAndInferError:

error msg:


check_failed_error {
}

Check failed: TxtString2PbMessage(env_proto_str, &env_proto)    failed to parse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true

Do you know what the reason is. @ShawnXuan

JF-D avatar Sep 15 '20 17:09 JF-D

I can train BERT in a single node. But for two node, I use this scripts

NUM_NODES=$1
NODE_IPS=$2

DATA_DIR=/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/wiki_ofrecord_seq_len_128_example
python run_pretraining.py \
  --gpu_num_per_node=8 \
  --num_nodes=$NUM_NODES \
  --node_ips=$NODE_IPS \
  --learning_rate=1e-4 \
  --batch_size_per_device=64 \
  --iter_num=100 \
  --loss_print_every_n_iter=20 \
  --seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_hidden_layers=12 \
  --num_attention_heads=12 \
  --max_position_embeddings=512 \
  --type_vocab_size=2 \
  --vocab_size=30522 \
  --attention_probs_dropout_prob=0.1 \
  --hidden_dropout_prob=0.1 \
  --hidden_size_per_head=64 \
  --data_part_num=1 \
  --data_dir=$DATA_DIR \
  --log_dir=./log \
  --model_save_every_n_iter=10000 \
  --save_last_snapshot=False \
  --model_save_dir=./snapshots

But I get the following error

Time stamp: 2020-09-16-01:52:58
[libprotobuf ERROR /oneflow-src/manylinux2014-build-cache-cuda-10.1/build-third-party/protobuf/src/protobuf/src/google/proto
buf/text_format.cc:303] Error parsing text-format oneflow.EnvProto: Message missing required fields: ctrl_port
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0916 01:52:58.912607 194902 error.cpp:26]  Check failed: TxtString2PbMessage(env_proto_str, &env_proto)        failed to pa
rse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true
Traceback (most recent call last):
  File "run_pretraining.py", line 120, in <module>
    main()
  File "run_pretraining.py", line 104, in main
    snapshot = Snapshot(args.model_save_dir, args.model_load_dir)
  File "/home/duanjiangfei/OneFlow-Benchmark/LanguageModeling/BERT/util.py", line 48, in __init__
    self._check_point.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_context.py", line 49$
 in Func
    GetDefaultSession().TryInit()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 204, $
n TryInit
    self.Init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/session_util.py", line 211, $
n Init
    oneflow.env.init()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 53, in api
_env_init
    return enable_if.unique([env_init, do_nothing])()
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/env_util.py", line 61, in env
_init
    c_api_util.InitEnv(default_env_proto)
  File "/home/duanjiangfei/.local.pt1.5s1/lib/python3.7/site-packages/oneflow/python/framework/c_api_util.py", line 100, in
InitEnv
    raise JobBuildAndInferError(error)
oneflow.python.framework.job_build_and_infer_error.JobBuildAndInferError:

error msg:


check_failed_error {
}

Check failed: TxtString2PbMessage(env_proto_str, &env_proto)    failed to parse env_protomachine {
  id: 0
  addr: "10.5.8.54"
}
machine {
  id: 1
  addr: "10.5.8.69"
}
cpp_logging_conf {
  log_dir: "./log"
}
grpc_use_no_signal: true

Do you know what the reason is. @ShawnXuan

@JF-D sorry. this is a stupid mistake in the script. Please uncomment the following line. We will update the script today.

https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/5c2b305312ee9141208fad71cba3a9f05da69dd4/LanguageModeling/BERT/util.py#L29

yuanms2 avatar Sep 16 '20 01:09 yuanms2

Thanks a lot. The Bert benchmark can run successfully. But for cnn benchmark, I cannot run due to https://github.com/Oneflow-Inc/OneFlow-Benchmark/issues/130#issuecomment-692714197

JF-D avatar Sep 16 '20 05:09 JF-D

Thanks a lot. The Bert benchmark can run successfully. But for cnn benchmark, I cannot run due to #130 (comment)

Thanks. this is due to the incompatibility between benchmark scripts and the oneflow release. You can try building from source to use the latest oneflow. We will also try our best to release a new version ASAP.

yuanms2 avatar Sep 16 '20 08:09 yuanms2

The default value of fuse_bn_relu and fuse_bn_add_relu was changed to False temporary, and will be back to True after next oneflow release. Please update your code, it should be fixed. thanks! @JF-D

ShawnXuan avatar Sep 17 '20 08:09 ShawnXuan

@ShawnXuan Thanks. @yuanms2 I think you should add some git tags to clarify different versions of benchmark.

I have one more question. You only realse the speed of bert base model, have you tried bert large model? I can get similar speed using the benchmark. Bert base throughput 145 samples/s. My machine is 32G V100(SXM2) + pytorch1.5 + cuda 10.1. Since I have some experiment results of bert large model I tested about 2 months ago, I compare it with the oneflow benchmark. The oneflow bert large is about 45 samples/s (~ 1400ms/iter), for my pytorch result, it is about 800 ms/iter (single card + bs64). This result doesn't quite match benchmark.

JF-D avatar Sep 18 '20 10:09 JF-D

@ShawnXuan Thanks. @yuanms2 I think you should add some git tags to clarify different versions of benchmark.

I have one more question. You only realse the speed of bert base model, have you tried bert large model? I can get similar speed using the benchmark. Bert base throughput 145 samples/s. My machine is 32G V100(SXM2) + pytorch1.5 + cuda 10.1. Since I have some experiment results of bert large model I tested about 2 months ago, I compare it with the oneflow benchmark. The oneflow bert large is about 45 samples/s (~ 1400ms/iter), for my pytorch result, it is about 800 ms/iter (single card + bs64). This result doesn't quite match benchmark.

@JF-D thank you. We will look into the bert-large training.

yuanms2 avatar Sep 19 '20 13:09 yuanms2