super-gradients icon indicating copy to clipboard operation
super-gradients copied to clipboard

Errors while using super_gradients.evaluate_from_recipe

Open JagdishKolhe opened this issue 1 year ago • 2 comments

🐛 Describe the bug

Looks like we can evaluate the model form trained checkpoints only, evaluate_from_recipe gives error if I want to evaluate pretrained model. I also get error for overriding the checkpoint path

python` -m super_gradients.evaluate_from_recipe --config-name=coco2017_yolo_nas_s checkpoint_params.checkpoint_path=/mc/**/yolo_nas_s/yolo_nas_s.pth  #** used for masking real path here

The error is [2023-05-18 10:58:48,638][super_gradients.training.models.model_factory][WARNING] - Passing num_classes through arch_params is deprecated and will be removed in the next version. Pass num_classes explicitly to models.get Error executing job with overrides: ['checkpoint_params.checkpoint_path=/mc//yolo_nas_s/yolo_nas_s.pth'] Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/super_gradients/evaluate_from_recipe.py", line 44, in main() File "/opt/conda/lib/python3.8/site-packages/super_gradients/evaluate_from_recipe.py", line 40, in main main() File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main run_hydra( File "/opt/conda/lib/python3.8/site-packages/hydra/internal/utils.py", line 394, in run_hydra run_app( File "/opt/conda/lib/python3.8/site-packages/hydra/internal/utils.py", line 457, in run_app run_and_report( File "/opt/conda/lib/python3.8/site-packages/hydra/internal/utils.py", line 223, in run_and_report raise ex File "/opt/conda/lib/python3.8/site-packages/hydra/internal/utils.py", line 220, in run_and_report return func() File "/opt/conda/lib/python3.8/site-packages/hydra/internal/utils.py", line 458, in lambda: hydra.run( File "/opt/conda/lib/python3.8/site-packages/hydra/internal/hydra.py", line 132, in run _ = ret.return_value File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value raise self.return_value File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "/opt/conda/lib/python3.8/site-packages/super_gradients/evaluate_from_recipe.py", line 35, in main Trainer.evaluate_from_recipe(cfg) File "/opt/conda/lib/python3.8/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 320, in evaluate_from_recipe model = models.get( File "/opt/conda/lib/python3.8/site-packages/super_gradients/training/models/model_factory.py", line 205, in get ckpt_entries = read_ckpt_state_dict(ckpt_path=checkpoint_path).keys() File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1207, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'YoloNAS_S' object has no attribute 'keys' Caching annotations: 61%|_____________________________________________________________________________ | 3045/5000 [00:01<00:01, 1939.90it/s][2023-05-18 10:58:51,233][torch.distributed.elastic.multiprocessing.api][WARNING] - Sending process 144534 closing signal SIGTERM [2023-05-18 10:58:51,234][torch.distributed.elastic.multiprocessing.api][WARNING] - Sending process 144535 closing signal SIGTERM [2023-05-18 10:58:51,235][torch.distributed.elastic.multiprocessing.api][WARNING] - Sending process 144536 closing signal SIGTERM [2023-05-18 10:58:51,235][torch.distributed.elastic.multiprocessing.api][WARNING] - Sending process 144537 closing signal SIGTERM [2023-05-18 10:58:51,236][torch.distributed.elastic.multiprocessing.api][WARNING] - Sending process 144538 closing signal SIGTERM [2023-05-18 10:58:51,238][torch.distributed.elastic.multiprocessing.api][WARNING] - Sending process 144539 closing signal SIGTERM [2023-05-18 10:58:51,238][torch.distributed.elastic.multiprocessing.api][WARNING] - Sending process 144540 closing signal SIGTERM [2023-05-18 10:58:52,041][torch.distributed.elastic.multiprocessing.api][ERROR] - failed (exitcode: 1) local_rank: 0 (pid: 144533) of binary: /opt/conda/bin/python Error executing job with overrides: ['checkpoint_params.checkpoint_path=/mc//yolo_nas_s/yolo_nas_s.pth'] Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/super_gradients/evaluate_from_recipe.py", line 44, in main() File "/opt/conda/lib/python3.8/site-packages/super_gradients/evaluate_from_recipe.py", line 40, in main _main() File "/opt/conda/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in lambda: hydra.run( File "/opt/conda/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 132, in run _ = ret.return_value File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/opt/conda/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "/opt/conda/lib/python3.8/site-packages/super_gradients/evaluate_from_recipe.py", line 35, in _main Trainer.evaluate_from_recipe(cfg) File "/opt/conda/lib/python3.8/site-packages/super_gradients/training/sg_trainer/sg_trainer.py", line 294, in evaluate_from_recipe setup_device( File "/opt/conda/lib/python3.8/site-packages/super_gradients/common/decorators/factory_decorator.py", line 36, in wrapper return func(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/super_gradients/training/utils/distributed_training_utils.py", line 240, in setup_device setup_gpu(multi_gpu, num_gpus) File "/opt/conda/lib/python3.8/site-packages/super_gradients/training/utils/distributed_training_utils.py", line 278, in setup_gpu restart_script_with_ddp(num_gpus=num_gpus) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/super_gradients/training/utils/distributed_training_utils.py", line 387, in restart_script_with_ddp elastic_launch(config=config, entrypoint=sys.executable)(*sys.argv, *EXTRA_ARGS) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Is there any working way to evaluate pretrained model?

Versions

Collecting environment information... PyTorch version: 1.12.1+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: version 3.21.3 Libc version: glibc-2.31

Python version: 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.4.0-84-generic-x86_64-with-glibc2.10 Is CUDA available: True CUDA runtime version: 11.5.50 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB GPU 1: Tesla V100-SXM2-32GB GPU 2: Tesla V100-SXM2-32GB GPU 3: Tesla V100-SXM2-32GB GPU 4: Tesla V100-SXM2-32GB GPU 5: Tesla V100-SXM2-32GB GPU 6: Tesla V100-SXM2-32GB GPU 7: Tesla V100-SXM2-32GB

Nvidia driver version: 525.60.13 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Stepping: 1 CPU MHz: 3329.013 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 BogoMIPS: 4389.82 Virtualization: VT-x L1d cache: 1.3 MiB L1i cache: 1.3 MiB L2 cache: 10 MiB L3 cache: 100 MiB NUMA node0 CPU(s): 0-19,40-59 NUMA node1 CPU(s): 20-39,60-79 Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

Versions of relevant libraries: [pip3] flake8==3.7.9 [pip3] numpy==1.22.4 [pip3] nvidia-dlprof-pytorch-nvtx==1.8.0 [pip3] pytorch-quantization==2.1.2 [pip3] qtorch==0.3.0 [pip3] torch==1.12.1 [pip3] torch-tensorrt==1.1.0a0 [pip3] torchmetrics==0.8.0 [pip3] torchnet==0.0.4 [pip3] torchtext==0.12.0a0 [pip3] torchvision==0.13.1 [conda] magma-cuda110 2.5.2 5 local [conda] mkl 2019.5 281 conda-forge [conda] mkl-include 2019.5 281 conda-forge [conda] numpy 1.22.4 pypi_0 pypi [conda] nvidia-dlprof-pytorch-nvtx 1.8.0 pypi_0 pypi [conda] pytorch-quantization 2.1.2 pypi_0 pypi [conda] qtorch 0.3.0 pypi_0 pypi [conda] torch 1.12.1 pypi_0 pypi [conda] torch-tensorrt 1.1.0a0 pypi_0 pypi [conda] torchmetrics 0.8.0 pypi_0 pypi [conda] torchnet 0.0.4 pypi_0 pypi [conda] torchtext 0.12.0a0 pypi_0 pypi [conda] torchvision 0.13.1 pypi_0 pypi

JagdishKolhe avatar May 18 '23 12:05 JagdishKolhe

Please use evaluate_checkpoint script which dedicated for exactly that. Let me know uf you run into further problems @JagdishKolhe .

shaydeci avatar Nov 27 '23 14:11 shaydeci