PaddleClas
PaddleClas copied to clipboard
Pod failed,Container failed
训练模型 出现错误
python3 -m paddle.distributed.launch \
--gpus="0,1,2,3" \
tools/train.py \
-c ppcls/configs/ImageNet/PPLCNetV2/PPLCNetV2_base.yaml
LAUNCH INFO 2025-05-14 10:11:15,889 ----------- Configuration ----------------------
LAUNCH INFO 2025-05-14 10:11:15,889 auto_parallel_config: None
LAUNCH INFO 2025-05-14 10:11:15,889 auto_tuner_json: None
LAUNCH INFO 2025-05-14 10:11:15,889 devices: 0
LAUNCH INFO 2025-05-14 10:11:15,890 elastic_level: -1
LAUNCH INFO 2025-05-14 10:11:15,890 elastic_timeout: 30
LAUNCH INFO 2025-05-14 10:11:15,890 enable_gpu_log: True
LAUNCH INFO 2025-05-14 10:11:15,891 gloo_port: 6767
LAUNCH INFO 2025-05-14 10:11:15,891 host: None
LAUNCH INFO 2025-05-14 10:11:15,891 ips: None
LAUNCH INFO 2025-05-14 10:11:15,891 job_id: default
LAUNCH INFO 2025-05-14 10:11:15,891 legacy: False
LAUNCH INFO 2025-05-14 10:11:15,891 log_dir: log
LAUNCH INFO 2025-05-14 10:11:15,891 log_level: INFO
LAUNCH INFO 2025-05-14 10:11:15,891 log_overwrite: False
LAUNCH INFO 2025-05-14 10:11:15,891 master: None
LAUNCH INFO 2025-05-14 10:11:15,891 max_restart: 3
LAUNCH INFO 2025-05-14 10:11:15,891 nnodes: 1
LAUNCH INFO 2025-05-14 10:11:15,891 nproc_per_node: None
LAUNCH INFO 2025-05-14 10:11:15,891 rank: -1
LAUNCH INFO 2025-05-14 10:11:15,891 run_mode: collective
LAUNCH INFO 2025-05-14 10:11:15,891 server_num: None
LAUNCH INFO 2025-05-14 10:11:15,891 servers:
LAUNCH INFO 2025-05-14 10:11:15,891 sort_ip: False
LAUNCH INFO 2025-05-14 10:11:15,891 start_port: 6070
LAUNCH INFO 2025-05-14 10:11:15,892 trainer_num: None
LAUNCH INFO 2025-05-14 10:11:15,892 trainers:
LAUNCH INFO 2025-05-14 10:11:15,892 training_script: tools/train.py
LAUNCH INFO 2025-05-14 10:11:15,892 training_script_args: ['-c', 'ppcls/configs/ImageNet/PPLCNetV2/PPLCNetV2_base.yaml']
LAUNCH INFO 2025-05-14 10:11:15,892 with_gloo: 1
LAUNCH INFO 2025-05-14 10:11:15,893 --------------------------------------------------
LAUNCH INFO 2025-05-14 10:11:15,894 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2025-05-14 10:11:15,894 Run Pod: ellnfi, replicas 1, status ready
LAUNCH INFO 2025-05-14 10:11:15,899 Watching Pod: ellnfi, replicas 1, status running
LAUNCH WARNING 2025-05-14 10:11:15,988 save gpu info failed
[2025/05/14 10:11:18] ppcls INFO:
===========================================================
== PaddleClas is powered by PaddlePaddle ! ==
===========================================================
== ==
== For more info please go to the following website. ==
== ==
== https://github.com/PaddlePaddle/PaddleClas ==
===========================================================
[2025/05/14 10:11:18] ppcls INFO: Global :
[2025/05/14 10:11:18] ppcls INFO: checkpoints : None
[2025/05/14 10:11:18] ppcls INFO: pretrained_model : None
[2025/05/14 10:11:18] ppcls INFO: output_dir : I:/output_dir/
[2025/05/14 10:11:18] ppcls INFO: device : gpu
[2025/05/14 10:11:18] ppcls INFO: save_interval : 1
[2025/05/14 10:11:18] ppcls INFO: eval_during_train : True
[2025/05/14 10:11:18] ppcls INFO: eval_interval : 1
[2025/05/14 10:11:18] ppcls INFO: epochs : 480
[2025/05/14 10:11:18] ppcls INFO: print_batch_step : 10
[2025/05/14 10:11:18] ppcls INFO: use_visualdl : False
[2025/05/14 10:11:18] ppcls INFO: image_shape : [3, 512, 512]
[2025/05/14 10:11:18] ppcls INFO: save_inference_dir : D:/save_inference_dir
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: AMP :
[2025/05/14 10:11:18] ppcls INFO: use_amp : False
[2025/05/14 10:11:18] ppcls INFO: use_fp16_test : False
[2025/05/14 10:11:18] ppcls INFO: scale_loss : 128.0
[2025/05/14 10:11:18] ppcls INFO: use_dynamic_loss_scaling : True
[2025/05/14 10:11:18] ppcls INFO: use_promote : False
[2025/05/14 10:11:18] ppcls INFO: level : O1
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: Arch :
[2025/05/14 10:11:18] ppcls INFO: name : PPLCNetV2_base
[2025/05/14 10:11:18] ppcls INFO: class_num : 1000
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: Loss :
[2025/05/14 10:11:18] ppcls INFO: Train :
[2025/05/14 10:11:18] ppcls INFO: CELoss :
[2025/05/14 10:11:18] ppcls INFO: weight : 1.0
[2025/05/14 10:11:18] ppcls INFO: epsilon : 0.1
[2025/05/14 10:11:18] ppcls INFO: Eval :
[2025/05/14 10:11:18] ppcls INFO: CELoss :
[2025/05/14 10:11:18] ppcls INFO: weight : 1.0
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: Optimizer :
[2025/05/14 10:11:18] ppcls INFO: name : Momentum
[2025/05/14 10:11:18] ppcls INFO: momentum : 0.9
[2025/05/14 10:11:18] ppcls INFO: lr :
[2025/05/14 10:11:18] ppcls INFO: name : Cosine
[2025/05/14 10:11:18] ppcls INFO: learning_rate : 0.8
[2025/05/14 10:11:18] ppcls INFO: warmup_epoch : 5
[2025/05/14 10:11:18] ppcls INFO: regularizer :
[2025/05/14 10:11:18] ppcls INFO: name : L2
[2025/05/14 10:11:18] ppcls INFO: coeff : 4e-05
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: DataLoader :
[2025/05/14 10:11:18] ppcls INFO: Train :
[2025/05/14 10:11:18] ppcls INFO: dataset :
[2025/05/14 10:11:18] ppcls INFO: name : MultiScaleDataset
[2025/05/14 10:11:18] ppcls INFO: image_root : D:/AI/
[2025/05/14 10:11:18] ppcls INFO: cls_label_path : D:/AI/train_list.txt
[2025/05/14 10:11:18] ppcls INFO: transform_ops :
[2025/05/14 10:11:18] ppcls INFO: DecodeImage :
[2025/05/14 10:11:18] ppcls INFO: to_rgb : True
[2025/05/14 10:11:18] ppcls INFO: channel_first : False
[2025/05/14 10:11:18] ppcls INFO: RandCropImage :
[2025/05/14 10:11:18] ppcls INFO: size : 512
[2025/05/14 10:11:18] ppcls INFO: RandFlipImage :
[2025/05/14 10:11:18] ppcls INFO: flip_code : 1
[2025/05/14 10:11:18] ppcls INFO: NormalizeImage :
[2025/05/14 10:11:18] ppcls INFO: scale : 1.0/255.0
[2025/05/14 10:11:18] ppcls INFO: mean : [0.485, 0.456, 0.406]
[2025/05/14 10:11:18] ppcls INFO: std : [0.229, 0.224, 0.225]
[2025/05/14 10:11:18] ppcls INFO: order :
[2025/05/14 10:11:18] ppcls INFO: sampler :
[2025/05/14 10:11:18] ppcls INFO: name : MultiScaleSampler
[2025/05/14 10:11:18] ppcls INFO: scales : [160, 192, 224, 288, 320]
[2025/05/14 10:11:18] ppcls INFO: first_bs : 500
[2025/05/14 10:11:18] ppcls INFO: divided_factor : 32
[2025/05/14 10:11:18] ppcls INFO: is_training : True
[2025/05/14 10:11:18] ppcls INFO: loader :
[2025/05/14 10:11:18] ppcls INFO: num_workers : 4
[2025/05/14 10:11:18] ppcls INFO: use_shared_memory : True
[2025/05/14 10:11:18] ppcls INFO: Eval :
[2025/05/14 10:11:18] ppcls INFO: dataset :
[2025/05/14 10:11:18] ppcls INFO: name : ImageNetDataset
[2025/05/14 10:11:18] ppcls INFO: image_root : D:/AI/
[2025/05/14 10:11:18] ppcls INFO: cls_label_path : D:/AI/val_list.txt
[2025/05/14 10:11:18] ppcls INFO: transform_ops :
[2025/05/14 10:11:18] ppcls INFO: DecodeImage :
[2025/05/14 10:11:18] ppcls INFO: to_rgb : True
[2025/05/14 10:11:18] ppcls INFO: channel_first : False
[2025/05/14 10:11:18] ppcls INFO: ResizeImage :
[2025/05/14 10:11:18] ppcls INFO: resize_short : 512
[2025/05/14 10:11:18] ppcls INFO: CropImage :
[2025/05/14 10:11:18] ppcls INFO: size : 512
[2025/05/14 10:11:18] ppcls INFO: NormalizeImage :
[2025/05/14 10:11:18] ppcls INFO: scale : 1.0/255.0
[2025/05/14 10:11:18] ppcls INFO: mean : [0.485, 0.456, 0.406]
[2025/05/14 10:11:18] ppcls INFO: std : [0.229, 0.224, 0.225]
[2025/05/14 10:11:18] ppcls INFO: order :
[2025/05/14 10:11:18] ppcls INFO: sampler :
[2025/05/14 10:11:18] ppcls INFO: name : DistributedBatchSampler
[2025/05/14 10:11:18] ppcls INFO: batch_size : 64
[2025/05/14 10:11:18] ppcls INFO: drop_last : False
[2025/05/14 10:11:18] ppcls INFO: shuffle : False
[2025/05/14 10:11:18] ppcls INFO: loader :
[2025/05/14 10:11:18] ppcls INFO: num_workers : 4
[2025/05/14 10:11:18] ppcls INFO: use_shared_memory : True
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: Infer :
[2025/05/14 10:11:18] ppcls INFO: infer_imgs : D:/AI/1.jpg
[2025/05/14 10:11:18] ppcls INFO: batch_size : 10
[2025/05/14 10:11:18] ppcls INFO: transforms :
[2025/05/14 10:11:18] ppcls INFO: DecodeImage :
[2025/05/14 10:11:18] ppcls INFO: to_rgb : True
[2025/05/14 10:11:18] ppcls INFO: channel_first : False
[2025/05/14 10:11:18] ppcls INFO: ResizeImage :
[2025/05/14 10:11:18] ppcls INFO: resize_short : 512
[2025/05/14 10:11:18] ppcls INFO: CropImage :
[2025/05/14 10:11:18] ppcls INFO: size : 512
[2025/05/14 10:11:18] ppcls INFO: NormalizeImage :
[2025/05/14 10:11:18] ppcls INFO: scale : 1.0/255.0
[2025/05/14 10:11:18] ppcls INFO: mean : [0.485, 0.456, 0.406]
[2025/05/14 10:11:18] ppcls INFO: std : [0.229, 0.224, 0.225]
[2025/05/14 10:11:18] ppcls INFO: order :
[2025/05/14 10:11:18] ppcls INFO: ToCHWImage : None
[2025/05/14 10:11:18] ppcls INFO: PostProcess :
[2025/05/14 10:11:18] ppcls INFO: name : Topk
[2025/05/14 10:11:18] ppcls INFO: topk : 5
[2025/05/14 10:11:18] ppcls INFO: class_id_map_file : D:/AI/label_list.txt
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: Metric :
[2025/05/14 10:11:18] ppcls INFO: Train :
[2025/05/14 10:11:18] ppcls INFO: TopkAcc :
[2025/05/14 10:11:18] ppcls INFO: topk : [1, 5]
[2025/05/14 10:11:18] ppcls INFO: Eval :
[2025/05/14 10:11:18] ppcls INFO: TopkAcc :
[2025/05/14 10:11:18] ppcls INFO: topk : [1, 5]
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: profiler_options : None
[2025/05/14 10:11:18] ppcls INFO: train with paddle 2.6.2 and device Place(gpu:0)
Traceback (most recent call last):
File "I:\AI\PaddleClas\tools\train.py", line 52, in <module>
engine = Engine(config, mode="train")
File "I:\AI\PaddleClas\ppcls\engine\engine.py", line 140, in __init__
self.train_dataloader = build_dataloader(
File "I:\AI\PaddleClas\ppcls\data\__init__.py", line 116, in build_dataloader
dataset = eval(dataset_name)(**config_dataset)
File "I:\AI\PaddleClas\ppcls\data\dataloader\multi_scale_dataset.py", line 54, in __init__
self._load_anno()
File "I:\AI\PaddleClas\ppcls\data\dataloader\multi_scale_dataset.py", line 71, in _load_anno
assert os.path.exists(self.images[-1])
AssertionError
LAUNCH INFO 2025-05-14 10:11:18,934 Pod failed
LAUNCH ERROR 2025-05-14 10:11:18,934 Container failed !!!
Job PR-5089-cb62dc3 is done. Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-5089/cb62dc3/index.html
Encourage anyone to help triage and fix the 3 test failures, as I will not be able to triage for at least two weeks.
Check out this pull request on ![]()
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
Going to wait for the release of PyTorch 2.7.1 as Commit 7f79222 should include update to NCCL 2.26.5 which may resolve the core dumps occurring in all 3 test case failures.
Error Log https://github.com/autogluon/autogluon/actions/runs/15449884138/job/43489311199?pr=5089 contained
[2025-06-04T18:50:47.608Z] E RuntimeError: cuDNN version incompatibility: PyTorch was compiled against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.
Will see if I can figure out what to change, but encourage anyone more familiar with the test setup to provide suggestions or additional commits
@tonyhoo @Innixma Based on preliminary analysis, should either add something like
export LD_LIBRARY_PATH=$(python -c "import torch; print(torch._C._cuda_getLibPath())"):$LD_LIBRARY_PATH:
before executing Pytorch tests, or if there are no non-pytorch tests which require CUDA, then start with an image that does not have CUDA pre-installed, so that only the one bundled with PyTorch is used. Or upgrade to a CUDA 12.6 based image, however I think the better option is to use the one bundled with PyTorch, which would support future tests where the version higher than current default is used (for example to have a new test against the 12.8 CUDA version bundled with the cu128 wheel of PyTorch)
@FireballDWF I am taking a look at the test failure issue and trying to reproduce it locally
I've updated the Docker image to run natively on Torch 2.7.1. Meanwhile, I've fixed the test cases to account for sensitive numeric differences on short training samples
Job PR-5089-4e80901 is done. Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-5089/4e80901/index.html