多卡训练,卡死
问题确认 Search before asking
- [X] 我已经搜索过问题,但是没有找到解答。I have searched the question and found no related answer.
请提出你的问题 Please ask your question
‘’‘ λ 3cbd864a9187 /home/PaddleSeg export CUDA_VISIBLE_DEVICES=0,1 λ 3cbd864a9187 /home/PaddleSeg python -m paddle.distributed.launch train.py \
--config configs/quick_start/pp_liteseg_optic_disc_512x512_1k.yml \ --do_eval \ --use_vdl \ --save_interval 500 \ --save_dir output
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script ----------- Configuration Arguments ----------- backend: auto elastic_server: None force: False gpus: None heter_devices: heter_worker_num: None heter_workers: host: None http_port: None ips: 127.0.0.1 job_id: None log_dir: log np: None nproc_per_node: None run_mode: None scale: 0 server_num: None servers: training_script: train.py training_script_args: ['--config', 'configs/quick_start/pp_liteseg_optic_disc_512x512_1k.yml', '--do_eval', '--use_vdl', '--save_interval', '500', '--save_dir', 'output'] worker_num: None workers:
WARNING 2024-01-29 03:58:17,377 launch.py:423] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode launch train in GPU mode! INFO 2024-01-29 03:58:17,378 launch_utils.py:528] Local start 2 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 0 | | PADDLE_CURRENT_ENDPOINT 127.0.0.1:43867 | | PADDLE_TRAINERS_NUM 2 | | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:43867,127.0.0.1:39123 | | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0,1 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+
INFO 2024-01-29 03:58:17,378 launch_utils.py:532] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 launch proc_id:95 idx:0 launch proc_id:100 idx:1 grep: warning: GREP_OPTIONS is deprecated; please use an alias or script 2024-01-29 03:58:19 [INFO] ------------Environment Information------------- platform: Linux-5.15.0-92-generic-x86_64-with-debian-stretch-sid Python: 3.7.0 (default, Jan 19 2022, 18:52:27) [GCC 8.2.0] Paddle compiled with cuda: True NVCC: Cuda compilation tools, release 10.2, V10.2.89 cudnn: 7.6 GPUs used: 2 CUDA_VISIBLE_DEVICES: 0,1 GPU: ['GPU 0: NVIDIA GeForce', 'GPU 1: NVIDIA GeForce'] GCC: gcc (GCC) 8.2.0 PaddleSeg: 2.5.0 PaddlePaddle: 2.2.2 OpenCV: 4.5.5
2024-01-29 03:58:19 [INFO] ---------------Config Information--------------- batch_size: 4 iters: 1000 loss: coef:
- 1
- 1
- 1 types:
- ignore_index: 255 type: CrossEntropyLoss lr_scheduler: end_lr: 0 learning_rate: 0.01 power: 0.9 type: PolynomialDecay model: backbone: pretrained: https://bj.bcebos.com/paddleseg/dygraph/PP_STDCNet1.tar.gz type: STDC1 type: PPLiteSeg optimizer: momentum: 0.9 type: sgd weight_decay: 4.0e-05 train_dataset: dataset_root: data/optic_disc_seg mode: train transforms:
- target_size:
- 512
- 512 type: Resize
- type: RandomHorizontalFlip
- type: Normalize type: OpticDiscSeg val_dataset: dataset_root: data/optic_disc_seg mode: val transforms:
- type: Normalize type: OpticDiscSeg
W0129 03:58:19.515584 95 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.2, Runtime API Version: 10.2 W0129 03:58:19.515611 95 device_context.cc:465] device: 0, cuDNN Version: 7.6.
’‘’ 使用容器进行多卡训练 ,输出上面的日志,之后就没输出了,python cpu使用率100%
过了一段时间就报这个错:
Traceback (most recent call last):
File "train.py", line 230, in
INFO 2024-01-29 04:02:26,684 launch_utils.py:320] terminate process group gid:100 INFO 2024-01-29 04:02:30,687 launch_utils.py:341] terminate all the procs ERROR 2024-01-29 04:02:30,687 launch_utils.py:604] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0] was aborted. Please check its log. INFO 2024-01-29 04:02:34,691 launch_utils.py:341] terminate all the procs INFO 2024-01-29 04:02:34,692 launch.py:311] Local processes completed.
你好,你的GPU配置似乎有问题,cuda/paddle是否都安装了?日志中没有找到你的GPU。