pp-shitu训练自己的数据遇到错误

Open rrjia opened this issue 2 years ago • 14 comments

欢迎您使用PaddleClas并反馈相关问题，非常感谢您对PaddleClas的贡献！提出issue时，辛苦您提供以下信息，方便我们快速定位问题并及时有效地解决您的问题：

PPaddleClas master 和paddlepaddle-gpu 2.3.0
训练环境信息： a. 具体操作系统，Linux b. Python版本号，Python3.8 c. CUDA/cuDNN版本，如CUDA11.1 log

Traceback (most recent call last):
  File "tools/train.py", line 32, in <module>
    engine.train()
  File "/ssd2/exec/jiaruoran/python/PaddleClas/ppcls/engine/engine.py", line 295, in train
    self.train_epoch_func(self, epoch_id, print_batch_step)
  File "/ssd2/exec/jiaruoran/python/PaddleClas/ppcls/engine/train/train.py", line 54, in train_epoch
    loss_dict = engine.train_loss_func(out, batch[1])
  File "/ssd2/exec/jiaruoran/python/PaddleClas/ppcls/loss/__init__.py", line 51, in __call__
    loss = self.loss_func[0](input, batch)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/slurm/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/slurm/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/ssd2/exec/jiaruoran/python/PaddleClas/ppcls/loss/celoss.py", line 58, in forward
    loss = F.cross_entropy(x, label=label, soft_label=soft_label)
  File "/ssd7/exec/jiaruoran/anaconda3/envs/slurm/lib/python3.7/site-packages/paddle/nn/functional/loss.py", line 1716, in cross_entropy
    label_min.item()))
ValueError: Target 4600925743447261037 is out of lower bound.

cpu模式下可以正常运行，GPU模式下报错

没有改动任何代码，只是按照https://github.com/PaddlePaddle/PaddleClas/blob/release/2.4/docs/zh_CN/image_recognition_pipeline/feature_extraction.md指导下进行自定义数据训练特征提取网络

May 25 '22 09:05 rrjia

有没有大佬来分析下，很奇怪的现象

May 26 '22 07:05 rrjia

你好，目前正在跟进并复现这个问题，有进展第一时间在这个issue中回复你

Jun 01 '22 07:06 HydrogenSulfate

我也遇到了同样的问题，CPU运行没问题，GPU会报同样的错：

W0811 06:05:36.229264 468 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.4, Runtime API Version: 10.2 W0811 06:05:36.317312 468 gpu_context.cc:306] device: 0, cuDNN Version: 7.6. [2022/08/11 06:27:04] ppcls INFO: unique_endpoints {''} [2022/08/11 06:27:04] ppcls INFO: Downloading MobileNetV1_pretrained.pdparams from https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/MobileNetV1_pretrained.pdparams 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25070/25070 [00:14<00:00, 1714.93it/s] [2022/08/11 06:27:19] ppcls WARNING: The training strategy provided by PaddleClas is based on 4 gpus. But the number of gpu is 1 in current training. Please modify the stategy (learning rate, batch size and so on) if use this config to train. Traceback (most recent call last): File "tools/train.py", line 32, in engine.train() File "/home/PaddleClas/ppcls/engine/engine.py", line 339, in train self.train_epoch_func(self, epoch_id, print_batch_step) File "/home/PaddleClas/ppcls/engine/train/train.py", line 54, in train_epoch loss_dict = engine.train_loss_func(out, batch[1]) File "/home/PaddleClas/ppcls/loss/init.py", line 63, in call loss = loss_func(input, batch) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in call return self._dygraph_call_func(*inputs, **kwargs) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func outputs = self.forward(*inputs, **kwargs) File "/home/PaddleClas/ppcls/loss/celoss.py", line 58, in forward loss = F.cross_entropy(x, label=label, soft_label=soft_label) File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/nn/functional/loss.py", line 1716, in cross_entropy label_min.item())) ValueError: Target -4715832250522597140 is out of lower bound.

Aug 11 '22 07:08 Firestick-Xia

@rrjia 楼主这个问题解决了吗？

Aug 11 '22 07:08 Firestick-Xia

@rrjia 楼主这个问题解决了吗？

可以先尝试检查一下训练时labels的范围是否在[0, class_num)

Aug 11 '22 07:08 HydrogenSulfate

@rrjia 我跑的就是样例CUB_200_2011

Aug 11 '22 07:08 Firestick-Xia

@rrjia 楼主这个问题解决了吗？

没有，官方是排查下给解答，一直没有回应

Aug 11 '22 08:08 rrjia

@rrjia 楼主这个问题解决了吗？

没有，官方是排查下给解答，一直没有回应

我这边之前用通用识别模型数据集训练过几次，都没有出现你说的这个问题，但是看到上面这位用户使用CUB数据集也能复现处这个问题，我这边尝试用CUB看看能否复现

Aug 11 '22 08:08 HydrogenSulfate

@rrjia 楼主这个问题解决了吗？

没有，官方是排查下给解答，一直没有回应

我这边用CUB200和shitu数据集都没能复现出你的问题

能告知一下您训练的机器环境吗，比如CUDA版本，python版本，系统是win还是linux？

Aug 12 '22 06:08 HydrogenSulfate

@HydrogenSulfate cuda:10.2 cudnn:7.6 paddlepaddle-gpu:2.3.1 python:3.7 ubuntu:20 MobileNetV1_retrieval.yaml文件： `# global configs Global: checkpoints: null pretrained_model: null output_dir: ./output/ device: gpu save_interval: 5 eval_during_train: True eval_interval: 1 epochs: 50 print_batch_step: 10 use_visualdl: False

used for static mode and model export

image_shape: [3, 224, 224] save_inference_dir: ./inference eval_mode: retrieval

model architecture

Arch: name: RecModel infer_output_key: features infer_add_softmax: False

Backbone: name: MobileNetV1 pretrained: False BackboneStopLayer: name: "flatten" Neck: name: FC embedding_size: 1024 class_num: 512 Head: name: ArcMargin
embedding_size: 512 class_num: 101 margin: 0.15 scale: 30

loss function config for traing/eval process

Loss: Train: - CELoss: weight: 1.0 - TripletLossV2: weight: 1.0 margin: 0.5 Eval: - CELoss: weight: 1.0

Optimizer: name: Momentum momentum: 0.9 lr: name: MultiStepDecay learning_rate: 0.01 milestones: [20, 30, 40] gamma: 0.5 verbose: False last_epoch: -1 regularizer: name: 'L2' coeff: 0.0005

data loader for train and eval

DataLoader: Train: dataset: name: VeriWild image_root: ./dataset/CUB_200_2011/ cls_label_path: ./dataset/CUB_200_2011/train_list.txt transform_ops: - DecodeImage: to_rgb: True channel_first: False - ResizeImage: size: 224 - RandFlipImage: flip_code: 1 - NormalizeImage: scale: 0.00392157 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] order: '' - RandomErasing: EPSILON: 0.5 sl: 0.02 sh: 0.4 r1: 0.3 mean: [0., 0., 0.] sampler: name: DistributedRandomIdentitySampler batch_size: 64 num_instances: 2 drop_last: False shuffle: True loader: num_workers: 0 use_shared_memory: True

Eval: Query: dataset: name: VeriWild image_root: ./dataset/CUB_200_2011/ cls_label_path: ./dataset/CUB_200_2011/test_list.txt transform_ops: - DecodeImage: to_rgb: True channel_first: False - ResizeImage: size: 224 - NormalizeImage: scale: 0.00392157 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] order: '' sampler: name: DistributedBatchSampler batch_size: 64 drop_last: False shuffle: False loader: num_workers: 4 use_shared_memory: True

Gallery:
  dataset: 
    name: VeriWild
    image_root: ./dataset/CUB_200_2011/
    cls_label_path: ./dataset/CUB_200_2011/test_list.txt
    transform_ops:
      - DecodeImage:
          to_rgb: True
          channel_first: False
      - ResizeImage:
          size: 224
      - NormalizeImage:
          scale: 1.0/255.0
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: ''
  sampler:
    name: DistributedBatchSampler
    batch_size: 64
    drop_last: False
    shuffle: False
  loader:
    num_workers: 4
    use_shared_memory: True

Metric: Eval: - Recallk: topk: [1, 5] - mAP: {}

Aug 16 '22 05:08 Firestick-Xia

@HydrogenSulfate cuda:10.2 cudnn:7.6 paddlepaddle-gpu:2.3.1 python:3.7 ubuntu:20 MobileNetV1_retrieval.yaml文件： `# global configs Global: checkpoints: null pretrained_model: null output_dir: ./output/ device: gpu save_interval: 5 eval_during_train: True eval_interval: 1 epochs: 50 print_batch_step: 10 use_visualdl: False

used for static mode and model export

image_shape: [3, 224, 224] save_inference_dir: ./inference eval_mode: retrieval

model architecture

Arch: name: RecModel infer_output_key: features infer_add_softmax: False

Backbone: name: MobileNetV1 pretrained: False BackboneStopLayer: name: "flatten" Neck: name: FC embedding_size: 1024 class_num: 512 Head: name: ArcMargin embedding_size: 512 class_num: 101 margin: 0.15 scale: 30

loss function config for traing/eval process

Loss: Train: - CELoss: weight: 1.0 - TripletLossV2: weight: 1.0 margin: 0.5 Eval: - CELoss: weight: 1.0

Optimizer: name: Momentum momentum: 0.9 lr: name: MultiStepDecay learning_rate: 0.01 milestones: [20, 30, 40] gamma: 0.5 verbose: False last_epoch: -1 regularizer: name: 'L2' coeff: 0.0005

data loader for train and eval

DataLoader: Train: dataset: name: VeriWild image_root: ./dataset/CUB_200_2011/ cls_label_path: ./dataset/CUB_200_2011/train_list.txt transform_ops: - DecodeImage: to_rgb: True channel_first: False - ResizeImage: size: 224 - RandFlipImage: flip_code: 1 - NormalizeImage: scale: 0.00392157 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] order: '' - RandomErasing: EPSILON: 0.5 sl: 0.02 sh: 0.4 r1: 0.3 mean: [0., 0., 0.] sampler: name: DistributedRandomIdentitySampler batch_size: 64 num_instances: 2 drop_last: False shuffle: True loader: num_workers: 0 use_shared_memory: True

Eval: Query: dataset: name: VeriWild image_root: ./dataset/CUB_200_2011/ cls_label_path: ./dataset/CUB_200_2011/test_list.txt transform_ops: - DecodeImage: to_rgb: True channel_first: False - ResizeImage: size: 224 - NormalizeImage: scale: 0.00392157 mean: [0.485, 0.456, 0.406] std: [0.229, 0.224, 0.225] order: '' sampler: name: DistributedBatchSampler batch_size: 64 drop_last: False shuffle: False loader: num_workers: 4 use_shared_memory: True
Gallery:
  dataset: 
    name: VeriWild
    image_root: ./dataset/CUB_200_2011/
    cls_label_path: ./dataset/CUB_200_2011/test_list.txt
    transform_ops:
      - DecodeImage:
          to_rgb: True
          channel_first: False
      - ResizeImage:
          size: 224
      - NormalizeImage:
          scale: 1.0/255.0
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: ''
  sampler:
    name: DistributedBatchSampler
    batch_size: 64
    drop_last: False
    shuffle: False
  loader:
    num_workers: 4
    use_shared_memory: True
Metric: Eval: - Recallk: topk: [1, 5] - mAP: {}

`

可以换成develop版的paddlepaddle-gpu，看看是否还有这个问题呢？

python3.7 -m pip install paddlepaddle-gpu==0.0.0.post102 -f https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html

Aug 16 '22 07:08 HydrogenSulfate

@HydrogenSulfate conda虚拟环境cuda:10.2 cudnn:7.6 不改变吧？需要不需要uninstall paddlepaddle-gpu 吗

Aug 16 '22 07:08 Firestick-Xia

@HydrogenSulfate conda虚拟环境cuda:10.2 cudnn:7.6 不改变吧？需要不需要uninstall paddlepaddle-gpu 吗

环境不用改变，paddlepaddle-gpu你安装develop时会自动卸载老版本的，不用手动卸载。我给你的安装命令就是官网上复制下来给cuda10.2用的，你也可以去官网自行安装对应的develop版paddlepaddle-gpu

Aug 16 '22 08:08 HydrogenSulfate

@HydrogenSulfate 大佬请教一个服务器的问题，ppcls就是paddleClas的conda虚拟环境微信图片_20220816165243 这是怎么回事啊，我已经碰到了两次了，我把环境删了，重新部署后，刚开始好好的，python命令行可以使用。然后不到一天，再用python命令就报这个错了，我网上查了半天也还是没有解决这个问题，希望能赐教一下！

Aug 16 '22 08:08 Firestick-Xia

我在PPHGNet上遇到了同样的问题，debug以后应该是在分类部分的标签处存在问题... 我这边是分类的地方加载标签文件中： ppcls/engine/evaluation/classification.py line59 batch[1] = batch[1].reshape([-1, 1]).astype("int64")这里如果用双引号 label的标签编码形式会改变可以试下单引号：`batch[1] = batch[1].reshape([-1, 1])

Oct 25 '22 07:10 LijiaDong1220

我在PPHGNet上遇到了同样的问题，debug以后应该是在分类部分的标签处存在问题... 我这边是分类的地方加载标签文件中： ppcls/engine/evaluation/classification.py line59 batch[1] = batch[1].reshape([-1, 1]).astype("int64")这里如果用双引号 label的标签编码形式会改变可以试下单引号：`batch[1] = batch[1].reshape([-1, 1])

单、双引号在python里应该是等价的吧，这个修改确定能解决这个问题吗

Nov 18 '22 09:11 HydrogenSulfate

我也遇到了同样的问题，跑demo 的时候会有报错，报错截图

demo ：https://github.com/PaddlePaddle/PaddleClas/blob/release/2.5/docs/zh_CN/quick_start/quick_start_classification_professional.md

Apr 20 '23 06:04 Genlk

PaddleClas PaddleClas copied to clipboard

pp-shitu训练自己的数据遇到错误

cpu模式下可以正常运行，GPU模式下报错

没有改动任何代码，只是按照https://github.com/PaddlePaddle/PaddleClas/blob/release/2.4/docs/zh_CN/image_recognition_pipeline/feature_extraction.md指导下进行自定义数据训练特征提取网络

used for static mode and model export

model architecture

loss function config for traing/eval process

data loader for train and eval

used for static mode and model export

model architecture

loss function config for traing/eval process

data loader for train and eval

PaddleClas
PaddleClas copied to clipboard