PaddleSeg 多卡训练，卡死

问题确认 Search before asking

[X] 我已经搜索过问题，但是没有找到解答。I have searched the question and found no related answer.

请提出你的问题 Please ask your question

‘’‘ λ 3cbd864a9187 /home/PaddleSeg export CUDA_VISIBLE_DEVICES=0,1 λ 3cbd864a9187 /home/PaddleSeg python -m paddle.distributed.launch train.py \

   --config configs/quick_start/pp_liteseg_optic_disc_512x512_1k.yml \
   --do_eval \
   --use_vdl \
   --save_interval 500 \
   --save_dir output

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script ----------- Configuration Arguments ----------- backend: auto elastic_server: None force: False gpus: None heter_devices: heter_worker_num: None heter_workers: host: None http_port: None ips: 127.0.0.1 job_id: None log_dir: log np: None nproc_per_node: None run_mode: None scale: 0 server_num: None servers: training_script: train.py training_script_args: ['--config', 'configs/quick_start/pp_liteseg_optic_disc_512x512_1k.yml', '--do_eval', '--use_vdl', '--save_interval', '500', '--save_dir', 'output'] worker_num: None workers:

WARNING 2024-01-29 03:58:17,377 launch.py:423] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode launch train in GPU mode! INFO 2024-01-29 03:58:17,378 launch_utils.py:528] Local start 2 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 0 | | PADDLE_CURRENT_ENDPOINT 127.0.0.1:43867 | | PADDLE_TRAINERS_NUM 2 | | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:43867,127.0.0.1:39123 | | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0,1 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+

INFO 2024-01-29 03:58:17,378 launch_utils.py:532] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 launch proc_id:95 idx:0 launch proc_id:100 idx:1 grep: warning: GREP_OPTIONS is deprecated; please use an alias or script 2024-01-29 03:58:19 [INFO] ------------Environment Information------------- platform: Linux-5.15.0-92-generic-x86_64-with-debian-stretch-sid Python: 3.7.0 (default, Jan 19 2022, 18:52:27) [GCC 8.2.0] Paddle compiled with cuda: True NVCC: Cuda compilation tools, release 10.2, V10.2.89 cudnn: 7.6 GPUs used: 2 CUDA_VISIBLE_DEVICES: 0,1 GPU: ['GPU 0: NVIDIA GeForce', 'GPU 1: NVIDIA GeForce'] GCC: gcc (GCC) 8.2.0 PaddleSeg: 2.5.0 PaddlePaddle: 2.2.2 OpenCV: 4.5.5

2024-01-29 03:58:19 [INFO] ---------------Config Information--------------- batch_size: 4 iters: 1000 loss: coef:

1
1
1 types:
ignore_index: 255 type: CrossEntropyLoss lr_scheduler: end_lr: 0 learning_rate: 0.01 power: 0.9 type: PolynomialDecay model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet1.tar.gz type: STDC1 type: PPLiteSeg optimizer: momentum: 0.9 type: sgd weight_decay: 4.0e-05 train_dataset: dataset_root: data/optic_disc_seg mode: train transforms:
target_size:
- 512
- 512 type: Resize
type: RandomHorizontalFlip
type: Normalize type: OpticDiscSeg val_dataset: dataset_root: data/optic_disc_seg mode: val transforms:
type: Normalize type: OpticDiscSeg

W0129 03:58:19.515584 95 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.2, Runtime API Version: 10.2 W0129 03:58:19.515611 95 device_context.cc:465] device: 0, cuDNN Version: 7.6.

’‘’ 使用容器进行多卡训练，输出上面的日志，之后就没输出了，python cpu使用率100%

Jan 29 '24 04:01 KellyGodLv

过了一段时间就报这个错： Traceback (most recent call last): File "train.py", line 230, in main(args) File "train.py", line 204, in main cfg._model = paddle.nn.SyncBatchNorm.convert_sync_batchnorm(cfg.model) File "/home/PaddleSeg/paddleseg/cvlibs/config.py", line 338, in model self._model = self._load_object(model_cfg) File "/home/PaddleSeg/paddleseg/cvlibs/config.py", line 396, in _load_object params[key] = self._load_object(val) File "/home/PaddleSeg/paddleseg/cvlibs/config.py", line 405, in _load_object return component(**params) File "/home/PaddleSeg/paddleseg/models/backbones/stdcnet.py", line 284, in STDC1 model = STDCNet(base=64, layers=[2, 2, 2], **kwargs) File "/home/PaddleSeg/paddleseg/models/backbones/stdcnet.py", line 62, in init self.features = self._make_layers(base, layers, block_num, block) File "/home/PaddleSeg/paddleseg/models/backbones/stdcnet.py", line 99, in _make_layers features += [ConvBNRelu(3, base // 2, 3, 2)] File "/home/PaddleSeg/paddleseg/models/backbones/stdcnet.py", line 137, in init bias_attr=False) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 656, in init data_format=data_format) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 135, in init default_initializer=_get_default_param_initializer()) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 422, in create_parameter default_initializer) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/layer_helper_base.py", line 378, in create_parameter **attr._to_kwargs(with_initializer=True)) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/framework.py", line 3137, in create_parameter initializer(param, self) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/initializer.py", line 362, in call stop_gradient=True) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/framework.py", line 3167, in append_op kwargs.get("stop_gradient", False)) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/tracer.py", line 45, in trace_op not stop_gradient) SystemError: (Fatal) Operator gaussian_random raises an thrust::system::system_error exception. The exception content is :parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device. (at /paddle/paddle/fluid/imperative/tracer.cc:221)

INFO 2024-01-29 04:02:26,684 launch_utils.py:320] terminate process group gid:100 INFO 2024-01-29 04:02:30,687 launch_utils.py:341] terminate all the procs ERROR 2024-01-29 04:02:30,687 launch_utils.py:604] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0] was aborted. Please check its log. INFO 2024-01-29 04:02:34,691 launch_utils.py:341] terminate all the procs INFO 2024-01-29 04:02:34,692 launch.py:311] Local processes completed.

Jan 29 '24 04:01 KellyGodLv

你好，你的GPU配置似乎有问题，cuda/paddle是否都安装了？日志中没有找到你的GPU。

Feb 05 '24 11:02 shiyutang