DeepLearningExamples icon indicating copy to clipboard operation
DeepLearningExamples copied to clipboard

[EfficientDet/PyTorch] TypeError: new(): invalid data type 'str' when training EfficientDet on Waymo dataset

Open ChongyuNVIDIA opened this issue 2 years ago • 0 comments

Related to EfficientDet/PyTorch

Describe the bug When I try to reproduce the EfficientDet training result on Waymo dataset as described in: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Detection/Efficientdet Meet the " TypeError: new(): invalid data type 'str' " issue after loading the Waymo dataset and start training.

To Reproduce Steps to reproduce the behavior:

  1. Git clone 'https://github.com/NVIDIA/DeepLearningExamples', cd DeepLearningExamples/PyTorch/Detection/Efficientdet
  2. run 'waymo_tool/waymo_data_converter.py' to downloads and converts the Waymo data into COCO format
  3. Change the dataset path according to 'scripts/waymo/train_waymo_AMP_8xA100-80G.sh'
  4. Launch './distributed_train.sh 8 /datasets/Waymo_JoC --model efficientdet_d0 -b 8 --amp --lr 0.2 --sync-bn --opt fusedmomentum --warmup-epochs 1 --output Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N --worker 8 --fill-color mean --model-ema --model-ema-decay 0.999 --eval-after 24 --epochs 24 --save-checkpoint-interval 1 --smoothing 0.0 --waymo --remove-weights class_net box_net anchor --input_size 1536 --num_classes 3 --resume --freeze-layers backbone --waymo-train /datasets/Waymo_JoC/waymo_coco_format_train/images --waymo-val /datasets/Waymo_JoC/waymo_coco_format_val/images --waymo-val-annotation /datasets/Waymo_JoC/waymo_coco_format_val/annotations/annotations.json --waymo-train-annotation /datasets/Waymo_JoC/waymo_coco_format_train/annotations/annotations.json'

Expected behavior Expect the EfficientDet training on Waymo dataset can be smooth.

Environment

  • Container version: pytorch:21.06-py3
  • GPUs in the system: 8x Tesla A100-80GB
  • CUDA version: 11.4
  • CUDA driver version: 470.82.01

The log info for the training execution: Added key: store_based_barrier_key:1 to store for rank: 6 Added key: store_based_barrier_key:1 to store for rank: 5 Added key: store_based_barrier_key:1 to store for rank: 3 Added key: store_based_barrier_key:1 to store for rank: 2 Added key: store_based_barrier_key:1 to store for rank: 7 Added key: store_based_barrier_key:1 to store for rank: 4 Added key: store_based_barrier_key:1 to store for rank: 1 Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for 8 nodes. Rank 6: Completed store-based barrier for 8 nodes. Rank 5: Completed store-based barrier for 8 nodes. Rank 3: Completed store-based barrier for 8 nodes. Rank 2: Completed store-based barrier for 8 nodes. Rank 7: Completed store-based barrier for 8 nodes. Rank 4: Completed store-based barrier for 8 nodes. Rank 1: Completed store-based barrier for 8 nodes. Training in distributed mode with multiple processes, 1 GPU per process. Process 2, total 8. Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 8. Training in distributed mode with multiple processes, 1 GPU per process. Process 4, total 8. Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 8. Training in distributed mode with multiple processes, 1 GPU per process. Process 5, total 8. Training in distributed mode with multiple processes, 1 GPU per process. Process 6, total 8. Training in distributed mode with multiple processes, 1 GPU per process. Process 3, total 8. Training in distributed mode with multiple processes, 1 GPU per process. Process 7, total 8. model does not have attribute module... model does not have attribute module... model does not have attribute module... model does not have attribute module... model does not have attribute module... model does not have attribute module... model does not have attribute module... model does not have attribute module... Restoring model state from stete_dict ... Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth' Restoring model state from stete_dict ... Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth' Restoring model state from stete_dict ... Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth' Restoring model state from stete_dict ... Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth' Restoring model state from stete_dict ... Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth' Restoring model state from stete_dict ... Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth' Restoring model state from stete_dict ... Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth' Restoring model state from stete_dict ... Loaded state_dict from checkpoint './backbone_checkpoints/efficientdet_backbone_efficientnet_b0_pyt_amp_ckpt_21.06.0.pth' Input size to be passed to dataloaders: 1536 Image size used in model: 1536 DLL 2022-04-06 06:03:54.651781 - PARAMETER model_name : efficientdet_d0 param_count : 3826868 Input size to be passed to dataloaders: 1536 Image size used in model: 1536 Input size to be passed to dataloaders: 1536 Image size used in model: 1536 Input size to be passed to dataloaders: 1536 Image size used in model: 1536 Input size to be passed to dataloaders: 1536 Image size used in model: 1536 Input size to be passed to dataloaders: 1536 Image size used in model: 1536 Input size to be passed to dataloaders: 1536 Image size used in model: 1536 Converted model to use Synchronized BatchNorm. WARNING: You may have issues if using zero initialized BN layers (enabled by default for ResNets) while sync-bn enabled. Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint Input size to be passed to dataloaders: 1536 Image size used in model: 1536 Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint Chong_Results/EfficientDet-D0/Waymo/Dense_PT2106_V100/FP16_24E_8N/train does not exist to load checkpoint Using torch DistributedDataParallel. Install NVIDIA Apex for Apex DDP. DLL 2022-04-06 06:03:56.451268 - PARAMETER Scheduled_epochs : 34 loading annotations into memory... loading annotations into memory... loading annotations into memory... loading annotations into memory... loading annotations into memory... loading annotations into memory... loading annotations into memory... loading annotations into memory... Done (t=69.44s) creating index... Done (t=71.93s) creating index... Done (t=72.49s) creating index... Done (t=72.71s) creating index... Done (t=72.77s) creating index... Done (t=73.04s) creating index... Done (t=73.05s) creating index... Done (t=73.08s) creating index... index created! index created! index created! index created! index created! index created! index created! index created! loading annotations into memory... loading annotations into memory... loading annotations into memory... loading annotations into memory... loading annotations into memory... loading annotations into memory... loading annotations into memory... loading annotations into memory... Done (t=22.79s) creating index... index created! Done (t=23.14s) creating index... Done (t=22.88s) creating index... Done (t=23.24s) creating index... Done (t=23.59s) creating index... Done (t=23.13s) creating index... Done (t=23.37s) creating index... Done (t=23.39s) creating index... index created! index created! index created! index created! index created! index created! index created! Traceback (most recent call last): File "train.py", line 635, in main() File "train.py", line 461, in main train_metrics = train_epoch( File "train.py", line 522, in train_epoch input, target = next(loader_iter) File "/ngc_0/Chong_dxxz_Projects/Gitlab/Adversarial_Detection/DeepLearningExamples/Detection/Waymo/data/loader.py", line 84, in iter for next_input, next_target in self.loader: File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in DataLoader worker process 0. Original Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "/ngc_0/Chong_dxxz_Projects/Gitlab/Adversarial_Detection/DeepLearningExamples/Detection/Waymo/data/loader.py", line 65, in fast_collate target[tk][i] = torch.tensor(tv, dtype=target[tk].dtype) TypeError: new(): invalid data type 'str'

ChongyuNVIDIA avatar Apr 06 '22 06:04 ChongyuNVIDIA