PaddleSeg
PaddleSeg copied to clipboard
[Bug]ocrnet train fail
LAUNCH INFO 2022-07-22 08:36:49,341 ----------- Configuration ----------------------
LAUNCH INFO 2022-07-22 08:36:49,341 devices: None
LAUNCH INFO 2022-07-22 08:36:49,341 elastic_level: -1
LAUNCH INFO 2022-07-22 08:36:49,341 elastic_timeout: 30
LAUNCH INFO 2022-07-22 08:36:49,341 gloo_port: 6767
LAUNCH INFO 2022-07-22 08:36:49,341 host: None
LAUNCH INFO 2022-07-22 08:36:49,341 job_id: default
LAUNCH INFO 2022-07-22 08:36:49,341 legacy: False
LAUNCH INFO 2022-07-22 08:36:49,341 log_dir: log
LAUNCH INFO 2022-07-22 08:36:49,341 log_level: INFO
LAUNCH INFO 2022-07-22 08:36:49,341 master: None
LAUNCH INFO 2022-07-22 08:36:49,341 max_restart: 3
LAUNCH INFO 2022-07-22 08:36:49,341 nnodes: 1
LAUNCH INFO 2022-07-22 08:36:49,341 nproc_per_node: None
LAUNCH INFO 2022-07-22 08:36:49,341 rank: -1
LAUNCH INFO 2022-07-22 08:36:49,341 run_mode: collective
LAUNCH INFO 2022-07-22 08:36:49,341 server_num: None
LAUNCH INFO 2022-07-22 08:36:49,341 servers:
LAUNCH INFO 2022-07-22 08:36:49,341 trainer_num: None
LAUNCH INFO 2022-07-22 08:36:49,341 trainers:
LAUNCH INFO 2022-07-22 08:36:49,341 training_script: train.py
LAUNCH INFO 2022-07-22 08:36:49,341 training_script_args: ['--config', 'custom_config/OCRnet_teibiBF0722_seg_512x512.yml', '--do_eval', '--use_vdl', '--save_interval', '500', '--save_dir', 'OCRnet_teibiBF0722_seg_512x512']
LAUNCH INFO 2022-07-22 08:36:49,341 with_gloo: 0
LAUNCH INFO 2022-07-22 08:36:49,341 --------------------------------------------------
LAUNCH INFO 2022-07-22 08:36:49,347 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2022-07-22 08:36:49,347 Run Pod: yxabnn, replicas 2, status ready
LAUNCH INFO 2022-07-22 08:36:49,372 Watching Pod: yxabnn, replicas 2, status running
2022-07-22 08:36:51 [INFO]
------------Environment Information-------------
platform: Linux-5.4.0-121-generic-x86_64-with-debian-buster-sid
Python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
Paddle compiled with cuda: True
NVCC: Build cuda_11.6.r11.6/compiler.30794723_0
cudnn: 8.4
GPUs used: 2
CUDA_VISIBLE_DEVICES: 0,1
GPU: ['GPU 0: NVIDIA GeForce', 'GPU 1: NVIDIA GeForce']
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PaddleSeg: 2.6.0
PaddlePaddle: 2.3.1
OpenCV: 4.5.5
------------------------------------------------
2022-07-22 08:36:51 [INFO]
---------------Config Information---------------
batch_size: 8
iters: 10000
loss:
coef:
- 1
- 0.4
types:
- ignore_index: 255
type: CrossEntropyLoss
- ignore_index: 255
type: CrossEntropyLoss
lr_scheduler:
learning_rate: 0.01
power: 0.9
type: PolynomialDecay
model:
backbone:
pretrained: https://bj.bcebos.com/paddleseg/dygraph/hrnet_w18_ssld.tar.gz
type: HRNet_W18
backbone_indices:
- 0
type: OCRNet
optimizer:
type: sgd
pretrained: null
train_dataset:
dataset_root: data/teibiBF0722_seg
mode: train
num_classes: 2
train_path: data/teibiBF0722_seg/train_list.txt
transforms:
- max_scale_factor: 1.25
min_scale_factor: 0.5
scale_step_size: 0.25
type: ResizeStepScaling
- crop_size:
- 512
- 512
type: RandomPaddingCrop
- type: RandomHorizontalFlip
- brightness_range: 0.5
contrast_range: 0.5
saturation_range: 0.5
type: RandomDistort
- type: Normalize
type: Dataset
val_dataset:
dataset_root: data/teibiBF0722_seg
mode: val
num_classes: 2
transforms:
- type: Normalize
type: Dataset
val_path: data/teibiBF0722_seg/val_list.txt
------------------------------------------------
W0722 08:36:51.392686 52161 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.6
W0722 08:36:51.392719 52161 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4.
2022-07-22 08:36:53 [INFO] Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/hrnet_w18_ssld.tar.gz
2022-07-22 08:36:55 [INFO] There are 1525/1525 variables loaded into HRNet.
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.1.1:45077']
I0722 08:36:59.477922 52161 nccl_context.cc:83] init nccl context nranks: 2 local rank: 0 gpu id: 0 ring id: 0
I0722 08:36:59.806169 52161 nccl_context.cc:115] init nccl context nranks: 2 local rank: 0 gpu id: 0 ring id: 10
2022-07-22 08:36:59,871-INFO: [topology.py:169:__init__] HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 2, mp_group: [0], sharding_group: [0], pp_group: [0], dp_group: [0, 1], check/clip group: [0]
/data/anaconda3/envs/paddle/bin/python: relocation error: /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcublas.so: symbol cublasLtLegacyGemmSSS version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference
LAUNCH INFO 2022-07-22 08:37:06,392 Pod failed
LAUNCH ERROR 2022-07-22 08:37:06,392 Container failed !!!
Container rank 0 status failed cmd ['/data/anaconda3/envs/paddle/bin/python', '-u', 'train.py', '--config', 'custom_config/OCRnet_teibiBF0722_seg_512x512.yml', '--do_eval', '--use_vdl', '--save_interval', '500', '--save_dir', 'OCRnet_teibiBF0722_seg_512x512'] code 127 log log/default.yxabnn.0.log
env {'LESSOPEN': '| /usr/bin/lesspipe %s', 'CONDA_PROMPT_MODIFIER': '(paddle) ', 'USER': 'alit', 'LANGUAGE': 'zh_CN:zh', 'TEXTDOMAIN': 'im-config', 'XDG_SEAT': 'seat0', 'SSH_AGENT_PID': '5322', 'XDG_SESSION_TYPE': 'x11', 'LD_LIBRARY_PATH': '/data/anaconda3/envs/paddle/lib/python3.7/site-packages/cv2/../../lib64:/usr/loca/cuda/lib64::/data/backup/TensorRT-8.4.0.6/lib', 'SHLVL': '1', 'CONDA_SHLVL': '2', 'QT4_IM_MODULE': 'xim', 'HOME': '/home/alit', 'DESKTOP_SESSION': 'ubuntu', 'GNOME_SHELL_SESSION_MODE': 'ubuntu', 'GTK_MODULES': 'gail:atk-bridge', 'MANAGERPID': '3159', 'DBUS_STARTER_BUS_TYPE': 'session', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus,guid=528d7336038a0d9c1378e50562d6078a', 'COLORTERM': 'truecolor', '_CE_M': '', 'MANDATORY_PATH': '/usr/share/gconf/ubuntu.mandatory.path', 'IM_CONFIG_PHASE': '2', 'CUDA_VISIBLE_DEVICES': '0,1', 'LOGNAME': 'alit', 'GTK_IM_MODULE': 'ibus', 'JOURNAL_STREAM': '9:95233', '_': '/data/anaconda3/envs/paddle/bin/python', 'DEFAULTS_PATH': '/usr/share/gconf/ubuntu.default.path', 'USERNAME': 'alit', 'XDG_SESSION_ID': '1', 'TERM': 'xterm-256color', '_CE_CONDA': '', 'GNOME_DESKTOP_SESSION_ID': 'this-is-deprecated', 'WINDOWPATH': '1', 'PATH': '/usr/local/cuda/bin:/data/anaconda3/envs/paddle/bin:/data/anaconda3/condabin:/home/alit/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/alit/.local/share/JetBrains/Toolbox/scripts', 'INVOCATION_ID': '1970446f957e410a9165420418dfe57b', 'SESSION_MANAGER': 'local/alit-PowerEdge-T640:@/tmp/.ICE-unix/5147,unix/alit-PowerEdge-T640:/tmp/.ICE-unix/5147', 'XDG_MENU_PREFIX': 'gnome-', 'GNOME_TERMINAL_SCREEN': '/org/gnome/Terminal/screen/d64ee701_5fbd_4e76_9bcb_4b88593a5e6d', 'XDG_RUNTIME_DIR': '/run/user/1000', 'DISPLAY': ':0', 'LANG': 'zh_CN.UTF-8', 'CONDA_PREFIX_1': '/data/anaconda3', 'XDG_CURRENT_DESKTOP': 'ubuntu:GNOME', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'XDG_SESSION_DESKTOP': 'ubuntu', 'XMODIFIERS': '@im=ibus', 'GNOME_TERMINAL_SERVICE': ':1.83', 'XAUTHORITY': '/run/user/1000/gdm/Xauthority', 'SSH_AUTH_SOCK': '/run/user/1000/keyring/ssh', 'CONDA_PYTHON_EXE': '/data/anaconda3/bin/python', 'SHELL': '/bin/bash', 'QT_ACCESSIBILITY': '1', 'GDMSESSION': 'ubuntu', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'CONDA_DEFAULT_ENV': 'paddle', 'TEXTDOMAINDIR': '/usr/share/locale/', 'GPG_AGENT_INFO': '/run/user/1000/gnupg/S.gpg-agent:0:1', 'XDG_VTNR': '1', 'QT_IM_MODULE': 'ibus', 'PWD': '/data/PaddleSeg-2.6', 'CUDA_HOME': '/usr/local/cuda', 'CLUTTER_IM_MODULE': 'xim', 'CONDA_EXE': '/data/anaconda3/bin/conda', 'XDG_DATA_DIRS': '/usr/share/ubuntu:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop', 'DBUS_STARTER_ADDRESS': 'unix:path=/run/user/1000/bus,guid=528d7336038a0d9c1378e50562d6078a', 'XDG_CONFIG_DIRS': '/etc/xdg/xdg-ubuntu:/etc/xdg', 'CONDA_PREFIX': '/data/anaconda3/envs/paddle', 'VTE_VERSION': '5202', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/data/anaconda3/envs/paddle/lib/python3.7/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/data/anaconda3/envs/paddle/lib/python3.7/site-packages/cv2/qt/fonts', 'PADDLE_MASTER': '127.0.1.1:48131', 'PADDLE_GLOBAL_SIZE': '2', 'PADDLE_LOCAL_SIZE': '2', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_TRAINER_ENDPOINTS': '127.0.1.1:33905,127.0.1.1:45077', 'PADDLE_CURRENT_ENDPOINT': '127.0.1.1:33905', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '2', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '0'}
2022-07-22 08:36:51 [INFO]
------------Environment Information-------------
platform: Linux-5.4.0-121-generic-x86_64-with-debian-buster-sid
Python: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0]
Paddle compiled with cuda: True
NVCC: Build cuda_11.6.r11.6/compiler.30794723_0
cudnn: 8.4
GPUs used: 2
CUDA_VISIBLE_DEVICES: 0,1
GPU: ['GPU 0: NVIDIA GeForce', 'GPU 1: NVIDIA GeForce']
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PaddleSeg: 2.6.0
PaddlePaddle: 2.3.1
OpenCV: 4.5.5
------------------------------------------------
2022-07-22 08:36:51 [INFO]
---------------Config Information---------------
batch_size: 8
iters: 10000
loss:
coef:
- 1
- 0.4
types:
- ignore_index: 255
type: CrossEntropyLoss
- ignore_index: 255
type: CrossEntropyLoss
lr_scheduler:
learning_rate: 0.01
power: 0.9
type: PolynomialDecay
model:
backbone:
pretrained: https://bj.bcebos.com/paddleseg/dygraph/hrnet_w18_ssld.tar.gz
type: HRNet_W18
backbone_indices:
- 0
type: OCRNet
optimizer:
type: sgd
pretrained: null
train_dataset:
dataset_root: data/teibiBF0722_seg
mode: train
num_classes: 2
train_path: data/teibiBF0722_seg/train_list.txt
transforms:
- max_scale_factor: 1.25
min_scale_factor: 0.5
scale_step_size: 0.25
type: ResizeStepScaling
- crop_size:
- 512
- 512
type: RandomPaddingCrop
- type: RandomHorizontalFlip
- brightness_range: 0.5
contrast_range: 0.5
saturation_range: 0.5
type: RandomDistort
- type: Normalize
type: Dataset
val_dataset:
dataset_root: data/teibiBF0722_seg
mode: val
num_classes: 2
transforms:
- type: Normalize
type: Dataset
val_path: data/teibiBF0722_seg/val_list.txt
------------------------------------------------
W0722 08:36:51.392686 52161 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.6
W0722 08:36:51.392719 52161 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4.
2022-07-22 08:36:53 [INFO] Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/hrnet_w18_ssld.tar.gz
2022-07-22 08:36:55 [INFO] There are 1525/1525 variables loaded into HRNet.
server not ready, wait 3 sec to retry...
not ready endpoints:['127.0.1.1:45077']
I0722 08:36:59.477922 52161 nccl_context.cc:83] init nccl context nranks: 2 local rank: 0 gpu id: 0 ring id: 0
I0722 08:36:59.806169 52161 nccl_context.cc:115] init nccl context nranks: 2 local rank: 0 gpu id: 0 ring id: 10
2022-07-22 08:36:59,871-INFO: [topology.py:169:__init__] HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 2, mp_group: [0], sharding_group: [0], pp_group: [0], dp_group: [0, 1], check/clip group: [0]
/data/anaconda3/envs/paddle/bin/python: relocation error: /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcublas.so: symbol cublasLtLegacyGemmSSS version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference
LAUNCH INFO 2022-07-22 08:37:06,393 Exit code 127
it seems that the error is relate to you local env: "relocation error: /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcublas.so: symbol cublasLtLegacyGemmSSS version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference LAUNCH INFO 2022-07-22 08:37:06,393 Exit code 127"
it seems that the error is relate to you local env: "relocation error: /usr/local/cuda-11.6/targets/x86_64-linux/lib/libcublas.so: symbol cublasLtLegacyGemmSSS version libcublasLt.so.11 not defined in file libcublasLt.so.11 with link time reference LAUNCH INFO 2022-07-22 08:37:06,393 Exit code 127"
我的环境是 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_Mar__8_18:18:20_PST_2022 Cuda compilation tools, release 11.6, V11.6.124 Build cuda_11.6.r11.6/compiler.31057947_0
cuda_11.6.2_510.47.03_linux.run
cudnn-linux-x86_64-8.4.1.50_cuda11.6-archive
NVIDIA-Linux-x86_64-510.47.03
paddle 2.3.1 cuda11.6
其他的模型都可以训练,这个报错
是否可以重新建立一个conda环境,使用最新的paddle查看是否还有相关问题。