PS异步训练模式下server出现core
当DistributedStrategy的a_sync=True时,server会core dump,a_sync=False时就正常
2server、3worker,CPU模式,Adam和SGD都会在如图位置core,请问是什么问题呢

能详细描述一下您使用的哪个脚本启动,启动命令和参数吗?然后描述一下您使用的paddle版本及paddlerec版本?我们最近这方面的代码有不少改动,没能复现您的问题,需要更详细的复现过程描述。
我在Paddle的项目也问了一下,可以参考这个 https://github.com/PaddlePaddle/Paddle/issues/37346 用的周五刚发布的v2.2.0 启动脚本:fleetrun --worker_num=4 --server_num=3 tools/static_ps_trainer.py -m models/rank/dlrm/config_bigdata.yaml Paddle版本:2.2.0(CPU) Python: 3.6.8 CentOS Linux release 7.2 (Final)
我在Paddle的项目也问了一下,可以参考这个 PaddlePaddle/Paddle#37346 用的周五刚发布的v2.2.0 启动脚本:fleetrun --worker_num=4 --server_num=3 tools/static_ps_trainer.py -m models/rank/dlrm/config_bigdata.yaml Paddle版本:2.2.0(CPU) Python: 3.6.8 CentOS Linux release 7.2 (Final)
我用paddle官方的镜像试了下同样命令,训练几个batch之后还是core了,报错相同 @yinhaofeng
我这边依然没能复现您的错误,能刚加详细的说明一下您是如何得到这个错误的吗?以及是否更改了代码或配置
我这边依然没能复现您的错误,能刚加详细的说明一下您是如何得到这个错误的吗?以及是否更改了代码或配置
1.直接使用官方 registry.baidubce.com/paddlepaddle/paddle:2.2.0 镜像,CPU版本
2.clone下PaddleRec后,先去dataset下载cretio完整数据集
3.修改model/rank/dlrm/config_bigdata.yaml中的use_gpu为false
4.在PaddleRec根目录运行fleetrun --worker_num=4 --server_num=3 tools/static_ps_trainer.py -m models/rank/dlrm/config_bigdata.yaml

您这边用的paddlerec版本是什么,是否有pull最新的代码?
您这边用的paddlerec版本是什么,是否有pull最新的代码?
v2.2.0,最新的代码
单机跑都是正常的,但在ps模式下就core了 机器96C 256G,内存是够的
我之前自己实现了一个dlrm,也core了
您运行的时候会产生log目录,麻烦截图给我们workerlog.0,serverlog.0,以及屏幕输出。如果全量数据需要时间较多,尝试demo数据是否会出现同样的bug
registry.baidubce.com/paddlepaddle/paddle:2.2.0
server.0: grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
+=======================================================================================+
| PaddleRec Benchmark Envs Value |
| hyper_parameters.bot_layer_sizes [512, 256, 64, 16] |
| runner.epochs 1 |
| runner.infer_batch_size 2048 |
| runner.infer_start_epoch 0 |
| runner.model_save_path output_model_dlrm |
| runner.print_interval 100 |
| runner.split_file_list False |
| runner.sync_mode async |
| runner.test_data_dir ../../../datasets/criteo/slot_test_data_full |
| runner.thread_num 1 |
| runner.use_gpu False |
"When training, we now always track global mean and variance.")
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py:744: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
elif dtype == np.bool:
INFO:main:cpu_num: 4
INFO:common:-- Role: PSERVER --
INFO:main:Run Server Begin
I1122 08:57:55.800235 10444 brpc_ps_server.cc:65] running server with rank id: 0, endpoint: 127.0.0.1:36520
C++ Traceback (most recent call last):
0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()
1 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool)
2 paddle::distributed::SAdam::update(unsigned long const*, float const*, unsigned long, std::vector<unsigned long, std::allocator
Error Message Summary:
FatalError: Segmentation fault is detected by the operating system.
[TimeInfo: *** Aborted at 1637571537 (unix time) try "date -d @1637571537" if you are using GNU date ***]
[SignalInfo: *** SIGSEGV (@0x0) received by PID 10444 (TID 0x7f345eff5700) from PID 0 ***]
worker.0
INFO:utils.static_ps.reader_helper:File: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/../../../datasets/criteo/slot_train_data_full/part-130 has 200000 examples
INFO:utils.static_ps.reader_helper:File: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/../../../datasets/criteo/slot_train_data_full/part-160 has 200000 examples
INFO:utils.static_ps.reader_helper:Total example: 44000000
/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance.
"When training, we now always track global mean and variance.")
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:341: UserWarning: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/net.py:103
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py:744: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
elif dtype == np.bool:
INFO:main:cpu_num: 4
INFO:common:-- Role: TRAINER --
INFO:main:Run Worker Begin
INFO:main:Epoch: 0, Running RecDatast Begin.
INFO:main:Epoch: 0, Batch_id: 0, cost: [0.8848212], auc: [0.50078756], avg_reader_cost: 0.00240 sec, avg_batch_cost: 0.00396 sec, avg_samples: 20.48000, ips: 5166.07934 example/sec
INFO:main:Epoch: 0, Batch_id: 100, cost: [0.5674137], auc: [0.64664944], avg_reader_cost: 0.17707 sec, avg_batch_cost: 0.24327 sec, avg_samples: 2048.00000, ips: 8418.67772 example/sec
W1122 08:58:57.628669 11033 input_messenger.cpp:222] Fail to read from fd=19 [email protected]:51669@46133: Connection reset by peer [104]
W1122 08:58:57.628697 11045 input_messenger.cpp:222] Fail to read from fd=21 [email protected]:51669@46134: Connection reset by peer [104]
W1122 08:58:57.628742 11044 input_messenger.cpp:222] Fail to read from fd=8 [email protected]:51669@46073: Connection reset by peer [104]
I1122 08:58:57.729051 11016 socket.cpp:2370] Checking [email protected]:51669
W1122 08:58:57.759953 11007 input_messenger.cpp:222] Fail to read from fd=13 [email protected]:36520@61550: Connection reset by peer [104]
W1122 08:58:57.759985 11033 input_messenger.cpp:222] Fail to read from fd=17 [email protected]:36520@61551: Connection reset by peer [104]
E1122 08:58:57.760069 11010 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E104]Fail to read from fd=13 [email protected]:36520@61550: Connection reset by peer [R1][E111]Fail to connect [email protected]:36520: Connection refused [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet
E1122 08:58:57.760084 11029 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E1014]Got EOF of fd=12 [email protected]:36520@61423 [R1][E112]Not connected to 127.0.0.1:36520 yet [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet
E1122 08:58:57.760102 11044 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E1014]Got EOF of fd=10 [email protected]:36520@61519 [R1][E112]Not connected to 127.0.0.1:36520 yet [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet
E1122 08:58:57.760108 11074 fleet.cc:296] fleet pull sparse failed, status[-1]
E1122 08:58:57.760123 11029 brpc_ps_client.cc:194] resquest cmd_id:3 failed, err:[E1014]Got EOF of fd=14 [email protected]:36520@61425 [R1][E112]Not connected to 127.0.0.1:36520 yet [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet
E1122 08:58:57.760124 11073 fleet.cc:296] fleet pull sparse failed, status[-1]
E1122 08:58:57.760130 10453 fleet.cc:296] fleet pull sparse failed, status[-1]
E1122 08:58:57.760159 11045 brpc_ps_client.cc:194] resquest cmd_id:2 failed, err:[E104]Fail to read from fd=17 [email protected]:36520@61551: Connection reset by peer [R1][E111]Fail to connect [email protected]:36520@46133: Connection refused [R2][E112]Not connected to 127.0.0.1:36520 yet [R3][E112]Not connected to 127.0.0.1:36520 yet
第二个batch后就停止了
看报错应该是没连接上server,你看一下其他的server有没有running server with rank id: 0, endpoint: 127.0.0.1:36520这种描述。
连接上server,你看一下其他的server有没有runni
worker的报错是没连接上server,server的报错是core了 那根源应该还是server core的原因吧
这事另一个server的log
cat serverlog.1 grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
+=======================================================================================+
| PaddleRec Benchmark Envs Value |
+---------------------------------------------------------------------------------------+
| config_abs_dir ... /docker/docker/PaddleRec/models/rank/dlrm|
| hyper_parameters.bot_layer_sizes [512, 256, 64, 16] |
| hyper_parameters.dense_input_dim 13 |
| hyper_parameters.num_field 26 |
| hyper_parameters.optimizer.class SGD |
|hyper_parameters.optimizer.learning_rate 0.1 |
| hyper_parameters.optimizer.strategy async |
| hyper_parameters.sparse_feature_dim 16 |
| hyper_parameters.sparse_feature_number 1000001 |
| hyper_parameters.sparse_inputs_slots 27 |
| hyper_parameters.top_layer_sizes [512, 256, 2] |
| runner.epochs 1 |
| runner.infer_batch_size 2048 |
| runner.infer_end_epoch 1 |
| runner.infer_load_path output_model_dlrm |
| runner.infer_reader_path criteo_reader |
| runner.infer_start_epoch 0 |
| runner.model_save_path output_model_dlrm |
| runner.print_interval 100 |
| runner.split_file_list False |
| runner.sync_mode async |
| runner.test_data_dir ../../../datasets/criteo/slot_test_data_full |
| runner.thread_num 1 |
| runner.train_batch_size 2048 |
| runner.train_data_dir ... ./../datasets/criteo/slot_train_data_full|
| runner.train_reader_path criteo_reader |
| runner.use_auc True |
| runner.use_gpu False |
| yaml_path models/rank/dlrm/config_bigdata.yaml |
+=======================================================================================+
/usr/local/lib/python3.7/dist-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance.
"When training, we now always track global mean and variance.")
/usr/local/lib/python3.7/dist-packages/paddle/fluid/layers/math_op_patch.py:341: UserWarning: /data/lijiajieli/docker/docker/PaddleRec/models/rank/dlrm/net.py:103
The behavior of expression A + B has been unified with elementwise_add(X, Y, axis=-1) from Paddle 2.0. If your code works well in the older versions but crashes in this version, try to use elementwise_add(X, Y, axis=0) instead of A + B. This transitional warning will be dropped in the future.
op_type, op_type, EXPRESSION_MAP[method_name]))
/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py:744: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
elif dtype == np.bool:
INFO:main:cpu_num: 4
INFO:common:-- Role: PSERVER --
INFO:main:Run Server Begin
I1122 08:57:55.766316 10447 brpc_ps_server.cc:65] running server with rank id: 1, endpoint: 127.0.0.1:51669
C++ Traceback (most recent call last):
0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1}> > >::_M_run()
1 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool)
2 paddle::distributed::SAdam::update(unsigned long const*, float const*, unsigned long, std::vector<unsigned long, std::allocator
Error Message Summary:
FatalError: Segmentation fault is detected by the operating system.
[TimeInfo: *** Aborted at 1637571537 (unix time) try "date -d @1637571537" if you are using GNU date ***]
[SignalInfo: *** SIGSEGV (@0x0) received by PID 10447 (TID 0x7efd44630700) from PID 0 ***]
再次确认一下,是否修改了paddlerec的代码,比如说在组网的embedding中增加了padding_idx参数或者修改了数据处理脚本中的padding值。
我重新下载了最新镜像,问题已经解决。之前使用的镜像是一个月前下载的,可能某些版本不兼容。感谢。
另外咨询一下,我想通过fleetrun在一台机器上启动多个节点,并且这些节点是在多个容器中启动的,是否可以支持? 目前看fleetrun会一次性启动当前ip所需的所有节点,能不能让运行这个命令的时候,只启动一个节点呢
暂不支持哦