PaddleClas icon indicating copy to clipboard operation
PaddleClas copied to clipboard

Pod failed,Container failed

Open monkeycc opened this issue 6 months ago • 1 comments

训练模型 出现错误

python3 -m paddle.distributed.launch \
    --gpus="0,1,2,3" \
    tools/train.py \
        -c ppcls/configs/ImageNet/PPLCNetV2/PPLCNetV2_base.yaml
LAUNCH INFO 2025-05-14 10:11:15,889 -----------  Configuration  ----------------------
LAUNCH INFO 2025-05-14 10:11:15,889 auto_parallel_config: None
LAUNCH INFO 2025-05-14 10:11:15,889 auto_tuner_json: None
LAUNCH INFO 2025-05-14 10:11:15,889 devices: 0
LAUNCH INFO 2025-05-14 10:11:15,890 elastic_level: -1
LAUNCH INFO 2025-05-14 10:11:15,890 elastic_timeout: 30
LAUNCH INFO 2025-05-14 10:11:15,890 enable_gpu_log: True
LAUNCH INFO 2025-05-14 10:11:15,891 gloo_port: 6767
LAUNCH INFO 2025-05-14 10:11:15,891 host: None
LAUNCH INFO 2025-05-14 10:11:15,891 ips: None
LAUNCH INFO 2025-05-14 10:11:15,891 job_id: default
LAUNCH INFO 2025-05-14 10:11:15,891 legacy: False
LAUNCH INFO 2025-05-14 10:11:15,891 log_dir: log
LAUNCH INFO 2025-05-14 10:11:15,891 log_level: INFO
LAUNCH INFO 2025-05-14 10:11:15,891 log_overwrite: False
LAUNCH INFO 2025-05-14 10:11:15,891 master: None
LAUNCH INFO 2025-05-14 10:11:15,891 max_restart: 3
LAUNCH INFO 2025-05-14 10:11:15,891 nnodes: 1
LAUNCH INFO 2025-05-14 10:11:15,891 nproc_per_node: None
LAUNCH INFO 2025-05-14 10:11:15,891 rank: -1
LAUNCH INFO 2025-05-14 10:11:15,891 run_mode: collective
LAUNCH INFO 2025-05-14 10:11:15,891 server_num: None
LAUNCH INFO 2025-05-14 10:11:15,891 servers:
LAUNCH INFO 2025-05-14 10:11:15,891 sort_ip: False
LAUNCH INFO 2025-05-14 10:11:15,891 start_port: 6070
LAUNCH INFO 2025-05-14 10:11:15,892 trainer_num: None
LAUNCH INFO 2025-05-14 10:11:15,892 trainers:
LAUNCH INFO 2025-05-14 10:11:15,892 training_script: tools/train.py
LAUNCH INFO 2025-05-14 10:11:15,892 training_script_args: ['-c', 'ppcls/configs/ImageNet/PPLCNetV2/PPLCNetV2_base.yaml']
LAUNCH INFO 2025-05-14 10:11:15,892 with_gloo: 1
LAUNCH INFO 2025-05-14 10:11:15,893 --------------------------------------------------
LAUNCH INFO 2025-05-14 10:11:15,894 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2025-05-14 10:11:15,894 Run Pod: ellnfi, replicas 1, status ready
LAUNCH INFO 2025-05-14 10:11:15,899 Watching Pod: ellnfi, replicas 1, status running
LAUNCH WARNING 2025-05-14 10:11:15,988 save gpu info failed
[2025/05/14 10:11:18] ppcls INFO:
===========================================================
==        PaddleClas is powered by PaddlePaddle !        ==
===========================================================
==                                                       ==
==   For more info please go to the following website.   ==
==                                                       ==
==       https://github.com/PaddlePaddle/PaddleClas      ==
===========================================================

[2025/05/14 10:11:18] ppcls INFO: Global :
[2025/05/14 10:11:18] ppcls INFO:     checkpoints : None
[2025/05/14 10:11:18] ppcls INFO:     pretrained_model : None
[2025/05/14 10:11:18] ppcls INFO:     output_dir : I:/output_dir/
[2025/05/14 10:11:18] ppcls INFO:     device : gpu
[2025/05/14 10:11:18] ppcls INFO:     save_interval : 1
[2025/05/14 10:11:18] ppcls INFO:     eval_during_train : True
[2025/05/14 10:11:18] ppcls INFO:     eval_interval : 1
[2025/05/14 10:11:18] ppcls INFO:     epochs : 480
[2025/05/14 10:11:18] ppcls INFO:     print_batch_step : 10
[2025/05/14 10:11:18] ppcls INFO:     use_visualdl : False
[2025/05/14 10:11:18] ppcls INFO:     image_shape : [3, 512, 512]
[2025/05/14 10:11:18] ppcls INFO:     save_inference_dir : D:/save_inference_dir
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: AMP :
[2025/05/14 10:11:18] ppcls INFO:     use_amp : False
[2025/05/14 10:11:18] ppcls INFO:     use_fp16_test : False
[2025/05/14 10:11:18] ppcls INFO:     scale_loss : 128.0
[2025/05/14 10:11:18] ppcls INFO:     use_dynamic_loss_scaling : True
[2025/05/14 10:11:18] ppcls INFO:     use_promote : False
[2025/05/14 10:11:18] ppcls INFO:     level : O1
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: Arch :
[2025/05/14 10:11:18] ppcls INFO:     name : PPLCNetV2_base
[2025/05/14 10:11:18] ppcls INFO:     class_num : 1000
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: Loss :
[2025/05/14 10:11:18] ppcls INFO:     Train :
[2025/05/14 10:11:18] ppcls INFO:         CELoss :
[2025/05/14 10:11:18] ppcls INFO:             weight : 1.0
[2025/05/14 10:11:18] ppcls INFO:             epsilon : 0.1
[2025/05/14 10:11:18] ppcls INFO:     Eval :
[2025/05/14 10:11:18] ppcls INFO:         CELoss :
[2025/05/14 10:11:18] ppcls INFO:             weight : 1.0
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: Optimizer :
[2025/05/14 10:11:18] ppcls INFO:     name : Momentum
[2025/05/14 10:11:18] ppcls INFO:     momentum : 0.9
[2025/05/14 10:11:18] ppcls INFO:     lr :
[2025/05/14 10:11:18] ppcls INFO:         name : Cosine
[2025/05/14 10:11:18] ppcls INFO:         learning_rate : 0.8
[2025/05/14 10:11:18] ppcls INFO:         warmup_epoch : 5
[2025/05/14 10:11:18] ppcls INFO:     regularizer :
[2025/05/14 10:11:18] ppcls INFO:         name : L2
[2025/05/14 10:11:18] ppcls INFO:         coeff : 4e-05
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: DataLoader :
[2025/05/14 10:11:18] ppcls INFO:     Train :
[2025/05/14 10:11:18] ppcls INFO:         dataset :
[2025/05/14 10:11:18] ppcls INFO:             name : MultiScaleDataset
[2025/05/14 10:11:18] ppcls INFO:             image_root : D:/AI/
[2025/05/14 10:11:18] ppcls INFO:             cls_label_path : D:/AI/train_list.txt
[2025/05/14 10:11:18] ppcls INFO:             transform_ops :
[2025/05/14 10:11:18] ppcls INFO:                 DecodeImage :
[2025/05/14 10:11:18] ppcls INFO:                     to_rgb : True
[2025/05/14 10:11:18] ppcls INFO:                     channel_first : False
[2025/05/14 10:11:18] ppcls INFO:                 RandCropImage :
[2025/05/14 10:11:18] ppcls INFO:                     size : 512
[2025/05/14 10:11:18] ppcls INFO:                 RandFlipImage :
[2025/05/14 10:11:18] ppcls INFO:                     flip_code : 1
[2025/05/14 10:11:18] ppcls INFO:                 NormalizeImage :
[2025/05/14 10:11:18] ppcls INFO:                     scale : 1.0/255.0
[2025/05/14 10:11:18] ppcls INFO:                     mean : [0.485, 0.456, 0.406]
[2025/05/14 10:11:18] ppcls INFO:                     std : [0.229, 0.224, 0.225]
[2025/05/14 10:11:18] ppcls INFO:                     order :
[2025/05/14 10:11:18] ppcls INFO:         sampler :
[2025/05/14 10:11:18] ppcls INFO:             name : MultiScaleSampler
[2025/05/14 10:11:18] ppcls INFO:             scales : [160, 192, 224, 288, 320]
[2025/05/14 10:11:18] ppcls INFO:             first_bs : 500
[2025/05/14 10:11:18] ppcls INFO:             divided_factor : 32
[2025/05/14 10:11:18] ppcls INFO:             is_training : True
[2025/05/14 10:11:18] ppcls INFO:         loader :
[2025/05/14 10:11:18] ppcls INFO:             num_workers : 4
[2025/05/14 10:11:18] ppcls INFO:             use_shared_memory : True
[2025/05/14 10:11:18] ppcls INFO:     Eval :
[2025/05/14 10:11:18] ppcls INFO:         dataset :
[2025/05/14 10:11:18] ppcls INFO:             name : ImageNetDataset
[2025/05/14 10:11:18] ppcls INFO:             image_root : D:/AI/
[2025/05/14 10:11:18] ppcls INFO:             cls_label_path : D:/AI/val_list.txt
[2025/05/14 10:11:18] ppcls INFO:             transform_ops :
[2025/05/14 10:11:18] ppcls INFO:                 DecodeImage :
[2025/05/14 10:11:18] ppcls INFO:                     to_rgb : True
[2025/05/14 10:11:18] ppcls INFO:                     channel_first : False
[2025/05/14 10:11:18] ppcls INFO:                 ResizeImage :
[2025/05/14 10:11:18] ppcls INFO:                     resize_short : 512
[2025/05/14 10:11:18] ppcls INFO:                 CropImage :
[2025/05/14 10:11:18] ppcls INFO:                     size : 512
[2025/05/14 10:11:18] ppcls INFO:                 NormalizeImage :
[2025/05/14 10:11:18] ppcls INFO:                     scale : 1.0/255.0
[2025/05/14 10:11:18] ppcls INFO:                     mean : [0.485, 0.456, 0.406]
[2025/05/14 10:11:18] ppcls INFO:                     std : [0.229, 0.224, 0.225]
[2025/05/14 10:11:18] ppcls INFO:                     order :
[2025/05/14 10:11:18] ppcls INFO:         sampler :
[2025/05/14 10:11:18] ppcls INFO:             name : DistributedBatchSampler
[2025/05/14 10:11:18] ppcls INFO:             batch_size : 64
[2025/05/14 10:11:18] ppcls INFO:             drop_last : False
[2025/05/14 10:11:18] ppcls INFO:             shuffle : False
[2025/05/14 10:11:18] ppcls INFO:         loader :
[2025/05/14 10:11:18] ppcls INFO:             num_workers : 4
[2025/05/14 10:11:18] ppcls INFO:             use_shared_memory : True
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: Infer :
[2025/05/14 10:11:18] ppcls INFO:     infer_imgs : D:/AI/1.jpg
[2025/05/14 10:11:18] ppcls INFO:     batch_size : 10
[2025/05/14 10:11:18] ppcls INFO:     transforms :
[2025/05/14 10:11:18] ppcls INFO:         DecodeImage :
[2025/05/14 10:11:18] ppcls INFO:             to_rgb : True
[2025/05/14 10:11:18] ppcls INFO:             channel_first : False
[2025/05/14 10:11:18] ppcls INFO:         ResizeImage :
[2025/05/14 10:11:18] ppcls INFO:             resize_short : 512
[2025/05/14 10:11:18] ppcls INFO:         CropImage :
[2025/05/14 10:11:18] ppcls INFO:             size : 512
[2025/05/14 10:11:18] ppcls INFO:         NormalizeImage :
[2025/05/14 10:11:18] ppcls INFO:             scale : 1.0/255.0
[2025/05/14 10:11:18] ppcls INFO:             mean : [0.485, 0.456, 0.406]
[2025/05/14 10:11:18] ppcls INFO:             std : [0.229, 0.224, 0.225]
[2025/05/14 10:11:18] ppcls INFO:             order :
[2025/05/14 10:11:18] ppcls INFO:         ToCHWImage : None
[2025/05/14 10:11:18] ppcls INFO:     PostProcess :
[2025/05/14 10:11:18] ppcls INFO:         name : Topk
[2025/05/14 10:11:18] ppcls INFO:         topk : 5
[2025/05/14 10:11:18] ppcls INFO:         class_id_map_file : D:/AI/label_list.txt
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: Metric :
[2025/05/14 10:11:18] ppcls INFO:     Train :
[2025/05/14 10:11:18] ppcls INFO:         TopkAcc :
[2025/05/14 10:11:18] ppcls INFO:             topk : [1, 5]
[2025/05/14 10:11:18] ppcls INFO:     Eval :
[2025/05/14 10:11:18] ppcls INFO:         TopkAcc :
[2025/05/14 10:11:18] ppcls INFO:             topk : [1, 5]
[2025/05/14 10:11:18] ppcls INFO: ------------------------------------------------------------
[2025/05/14 10:11:18] ppcls INFO: profiler_options : None
[2025/05/14 10:11:18] ppcls INFO: train with paddle 2.6.2 and device Place(gpu:0)
Traceback (most recent call last):
  File "I:\AI\PaddleClas\tools\train.py", line 52, in <module>
    engine = Engine(config, mode="train")
  File "I:\AI\PaddleClas\ppcls\engine\engine.py", line 140, in __init__
    self.train_dataloader = build_dataloader(
  File "I:\AI\PaddleClas\ppcls\data\__init__.py", line 116, in build_dataloader
    dataset = eval(dataset_name)(**config_dataset)
  File "I:\AI\PaddleClas\ppcls\data\dataloader\multi_scale_dataset.py", line 54, in __init__
    self._load_anno()
  File "I:\AI\PaddleClas\ppcls\data\dataloader\multi_scale_dataset.py", line 71, in _load_anno
    assert os.path.exists(self.images[-1])
AssertionError
LAUNCH INFO 2025-05-14 10:11:18,934 Pod failed
LAUNCH ERROR 2025-05-14 10:11:18,934 Container failed !!!

monkeycc avatar May 14 '25 02:05 monkeycc

Job PR-5089-cb62dc3 is done. Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-5089/cb62dc3/index.html

github-actions[bot] avatar Apr 24 '25 20:04 github-actions[bot]

Encourage anyone to help triage and fix the 3 test failures, as I will not be able to triage for at least two weeks.

FireballDWF avatar Apr 26 '25 04:04 FireballDWF

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Going to wait for the release of PyTorch 2.7.1 as Commit 7f79222 should include update to NCCL 2.26.5 which may resolve the core dumps occurring in all 3 test case failures.

FireballDWF avatar May 21 '25 04:05 FireballDWF

Error Log https://github.com/autogluon/autogluon/actions/runs/15449884138/job/43489311199?pr=5089 contained

[2025-06-04T18:50:47.608Z] E                       RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (9, 5, 1) but found runtime version (9, 1, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN. one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.

Will see if I can figure out what to change, but encourage anyone more familiar with the test setup to provide suggestions or additional commits

FireballDWF avatar Jun 04 '25 19:06 FireballDWF

@tonyhoo @Innixma Based on preliminary analysis, should either add something like

export LD_LIBRARY_PATH=$(python -c "import torch; print(torch._C._cuda_getLibPath())"):$LD_LIBRARY_PATH:

before executing Pytorch tests, or if there are no non-pytorch tests which require CUDA, then start with an image that does not have CUDA pre-installed, so that only the one bundled with PyTorch is used. Or upgrade to a CUDA 12.6 based image, however I think the better option is to use the one bundled with PyTorch, which would support future tests where the version higher than current default is used (for example to have a new test against the 12.8 CUDA version bundled with the cu128 wheel of PyTorch)

FireballDWF avatar Jun 04 '25 20:06 FireballDWF

@FireballDWF I am taking a look at the test failure issue and trying to reproduce it locally

tonyhoo avatar Jun 17 '25 20:06 tonyhoo

I've updated the Docker image to run natively on Torch 2.7.1. Meanwhile, I've fixed the test cases to account for sensitive numeric differences on short training samples

tonyhoo avatar Jun 18 '25 05:06 tonyhoo

Job PR-5089-4e80901 is done. Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-5089/4e80901/index.html

github-actions[bot] avatar Jun 18 '25 08:06 github-actions[bot]