PaddleRec
PaddleRec copied to clipboard
[使用问题] paddlecloud分布式训练demo报错
问题概述:参照distributed_train.md教程提交paddlecloud训练任务。demo和配置同教程所述,配置采用K8S集群的Collective模式配置
。任务运行没有输出,日志有报错。
任务详情
安装paddle-rec(run.log显示安装成功)
# before_hook.sh
pip install paddle-rec==1.8.5.1
pip uninstall -y paddlepaddle
python -m pip install paddlepaddle-gpu==1.8.5.post107 -i https://mirror.baidu.com/pypi/simple
报错信息(workerlog)
# /env_run/logs/workerlog.0
...
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: Deprec
ationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
TensorRT dynamic library (libnvinfer.so) that Paddle depends on is not configured correctly. (error code is libnvinfer.so: cannot
open shared object file: No such file or directory)
Suggestions:
1. Check if TensorRT is installed correctly and its version is matched with paddlepaddle you installed.
2. Configure TensorRT dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
- Windows: set PATH by `set PATH=XXX;PaddleRec: Runner collective_cluster Begin
PADDLEREC_CLUSTER_TYPE: K8S
PaddleRec run on device GPU: 0
Executor Mode: train
processor_register begin
Running CollectiveInstance.
Running CollectiveNetwork.
Traceback (most recent call last):
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/trainer.py", line 255, in run
self.context_process(self._context)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/trainer.py", line 216, in context_process
self._status_processor[context['status']](context)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/trainers/general_trainer.py", line 90, in network
network_class.build_network(context)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/trainers/framework/network.py", line 392, in buil
d_network
model._data_loader)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/trainers/framework/dataset.py", line 67, in get_d
ataloader
"", dataset_name, context["config_yaml"], context)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddlerec/core/utils/dataloader_instance.py", line 115, in slotd
ataloader_by_name
hidden_file_list=[], data_file_list=[], train_data_path=data_path)
TypeError: cannot unpack non-iterable NoneType object
Catch Exception:cannot unpack non-iterable NoneType object
--------------------------------
PaddleRec Error Message Summary:
--------------------------------
Exit PaddleRec. catch exception in precoss status: [network_pass], except: cannot unpack non-iterable NoneType object
TypeError
run.log日志
selected_gpus:range(0, 1)
use_paddlecloud_flag:True
node_ips:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724,job-0bb5fab60b346c8c-trainer-1.e22e0c50-d2e3-11e9-b5
8f-a0369f713724
node_ip:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724
node_rank:0
num_nodes: 2
cluster:job_server:None pods:["rank:0 id:None addr:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724 port:None
visible_gpu:[] trainers:['gpu:[0] endpoint:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724:35024 rank:0']", "
rank:1 id:None addr:job-0bb5fab60b346c8c-trainer-1.e22e0c50-d2e3-11e9-b58f-a0369f713724 port:None visible_gpu:[] trainers:['gpu:[
0] endpoint:job-0bb5fab60b346c8c-trainer-1.e22e0c50-d2e3-11e9-b58f-a0369f713724:35024 rank:1']"] job_stage_flag:None hdfs:None
pod:rank:0 id:None addr:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724 port:None visible_gpu:[] trainers:['g
pu:[0] endpoint:job-0bb5fab60b346c8c-trainer-0.e22e0c50-d2e3-11e9-b58f-a0369f713724:35024 rank:0']
~/paddlejob/workspace
2020-11-11 12:02:00 [INFO] [/root/paddlejob/run.sh: 251] [start_user_end_hook_process] end_hook start ...
~/paddlejob/workspace/env_run ~/paddlejob/workspace
Run before_hook.sh ...
~/paddlejob/workspace
~/paddlejob/workspace ~/paddlejob/workspace
2020-11-11 12:02:00 [INFO] [/root/paddlejob/tools/end_hook.sh: 14] [start_umount_afs] starting umount afs
no need umount afs
2020-11-11 12:02:00 [INFO] [/root/paddlejob/tools/end_hook.sh: 21] [start_umount_afs] finished umount afs
2020-11-11 12:02:00 [INFO] [/root/paddlejob/tools/end_hook.sh: 30] [data_clean] data_clear start ...
~/paddlejob/workspace
2020-11-11 12:02:00 [INFO] [/root/paddlejob/run.sh: 554] [taks_allreduce_mode] trainer successed.
k8s job finished
目前k8s collective模式只支持单机多卡,多机多卡训练还在开发中。