edgeai-benchmark icon indicating copy to clipboard operation
edgeai-benchmark copied to clipboard

YOLOv5 Ti Lite Custom Model Compilation Process

Open dpetersonVT23 opened this issue 2 years ago • 58 comments

Is this the correct repository to compile and deploy a custom trained YOLOv5 model from the YOLOv5 Ti repository (https://github.com/TexasInstruments/edgeai-yolov5)?

I am having trouble figuring out where to start in this repo, ie where to put the trained weights and begin compilation. I have run the setup script and already trained my custom model using the edgeai-yolov5 repository.

Should I benchmark or compile first? What are the steps to do so successfully? Any guidance from this point is appreciated.

dpetersonVT23 avatar Aug 11 '22 14:08 dpetersonVT23

You can use this script for your custom compile model: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_custom_pc.sh

Several models are listed in that file for convenience. You can comment out all the models except yolov5, since your's is specifically about yolov5.

You can also look at this tutorial to understand single model compilation: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_tutorials_pc.sh

As you know our default benchmark script that compiles all models is: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_benchmarks_pc.sh But you can also run only one specific model by selecting that model's id in the settings yaml file: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/settings_base.yaml#L65

mathmanu avatar Aug 11 '22 14:08 mathmanu

Thank you for the quick response @mathmanu, will look more closely at these files and let you know if I have any questions!

dpetersonVT23 avatar Aug 11 '22 14:08 dpetersonVT23

@mathmanu I have commented out everything except for the YOLOv5, I replaced the paths to the .onnx and .prototxt files for my custom model. Is there something else I have to do regarding datasets? I am getting errors running the run_custom_oc.sh script. Thanks again!

dpetersonVT23 avatar Aug 11 '22 15:08 dpetersonVT23

Comment out these datasets that are not needed in your case: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py#L151

Also set the dataset path appropriately: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py#L105 https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py#L112

That should be enough.

mathmanu avatar Aug 11 '22 15:08 mathmanu

@mathmanu I am confused why the compilation needs access to the dataset, can you help me understand that?

Got the cls and seg commented out, for the get_imagedet_dataset_loaders function, I have my dataset setup as required for YOLO traininig (images and labels directories each with train, test, and val subdirectories), what format needs to be returned from this function? The code currently in it seems specific to COCO, so Im not sure what to replace it with.

dpetersonVT23 avatar Aug 11 '22 15:08 dpetersonVT23

compilation requires a set of images. edgeai-benchmark compilation works with datasets and it can also generate accuracy. You can provide your own dataset there instead of coco, but the data loader there understands coco format.

If you are looking for a simple script that does compilation only with a few images, you can use our low level tidl tools repository: https://github.com/TexasInstruments/edgeai-tidl-tools

mathmanu avatar Aug 11 '22 15:08 mathmanu

@mathmanu My goal is to prepare this model for deployment on the Beaglebone AI and have it compiled such that it will take advantage of the hardware accelerators. Is the easiest way to do this through the repo you linked or through the benchmark_custom script?

I am more than fine to skip the actual benchmarking for now, my immediate goal is to achieve compilation and a method of deployment for this board. Thanks for your timely help!

dpetersonVT23 avatar Aug 11 '22 15:08 dpetersonVT23

@mathmanu I continued working with tutorial_detection.ipynb, same idea as the benchmark_custom.py script. When I run the cell with tools.run_accuracy, the execution of the cell hangs and does not complete, it remains at 0% task completion. The model path, model file, and pipline config all print out, but nothing else after that.

When I use the same contents in the benchmark_custom.py script it has this output on the terminal (running run_custom_pc.sh), let me know your insight on this error output:

Final number of subgraphs created are : 1, - Offloaded Nodes - 242, Total Nodes - 242 2022-08-11 13:47:30.858408637 [E:onnxruntime:, inference_session.cc:1311 operator()] Exception during initialization: /home/a0230315/workarea/onnxrt/onnxruntime/include/onnxruntime/core/graph/graph.h:1300 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const node_index < nodes_.size() was false. Validating no unexpected access using an invalid node_index. Got:65 Max:1

[ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /home/a0230315/workarea/onnxrt/onnxruntime/include/onnxruntime/core/graph/graph.h:1300 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const node_index < nodes_.size() was false. Validating no unexpected access using an invalid node_index. Got:65 Max:1

Traceback (most recent call last): File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/pipeline_runner.py", line 135, in _run_pipeline accuracy_result = accuracy_pipeline(description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 104, in call param_result = self._run(description=description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 146, in _run output_list = self._infer_frames(description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 194, in _infer_frames is_ok = session.start_infer() File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/sessions/onnxrt_session.py", line 80, in start_infer self.interpreter = self._create_interpreter(is_import=False) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/sessions/onnxrt_session.py", line 132, in _create_interpreter provider_options=[runtime_options, {}], sess_options=sess_options) File "/home/mm282681/miniconda3/envs/benchmark/lib/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 283, in init self._create_inference_session(providers, provider_options) File "/home/mm282681/miniconda3/envs/benchmark/lib/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 315, in create_inference_session sess.initialize_session(providers, provider_options) onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /home/a0230315/workarea/onnxrt/onnxruntime/include/onnxruntime/core/graph/graph.h:1300 onnxruntime::Node* onnxruntime::Graph::NodeAtIndexImpl(onnxruntime::NodeIndex) const node_index < nodes.size() was false. Validating no unexpected access using an invalid node_index. Got:65 Max:1

TASKS | 100%|██████████||

dpetersonVT23 avatar Aug 11 '22 18:08 dpetersonVT23

Is it possible to share your .onnx and .prototxt file so that we can take a look

CC: @debapriyamaji

mathmanu avatar Aug 12 '22 09:08 mathmanu

Sure, I have attached them in a zip. The only modification I have made after exporting to .onnx from .pt is changing the confidence_threshold in the .prototxt from 0.005 to 0.3.

Please let me know if there is anything else I can provide, thanks. @mathmanu @debapriyamaji test_640s_ti_lite.zip

To note: I saw this morning that I do have an artifacts folder with some .txts and a subdirectory with some .bins after running this script even though I still get the above error. Is there any code where I can test the compiled weights (artifacts I assume is an equivalent term) to confirm if they compiled correctly and work as expected when running inference?

dpetersonVT23 avatar Aug 12 '22 11:08 dpetersonVT23

While we are waiting for @debapriyamaji to take a look at what you shared, you can try this: The compiled artifact is supposed to work in the EVM using the EdgeAI SDK: https://www.ti.com/tool/download/PROCESSOR-SDK-LINUX-SK-TDA4VM (You can package the artifact by running ./run_package_artifact.sh and try to use it in the EVM.

mathmanu avatar Aug 15 '22 12:08 mathmanu

Sounds good thanks @mathmanu. the artifacts were packaged successfully with that script, will see what I can do in the EVM with the EdgeAI SDK in the meantime.

dpetersonVT23 avatar Aug 15 '22 12:08 dpetersonVT23

Has @debapriyamaji has a chance to review the files? Still working on the integration test on the board.

dpetersonVT23 avatar Aug 18 '22 20:08 dpetersonVT23

Make sure that you are using this configuration and that input_optimization is set to False: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/scripts/benchmark_custom.py#L260

Several customers have questions about yolov5, so @debapriyamaji is integrating yolov5 into our https://github.com/TexasInstruments/edgeai-modelmaker We hope to release the update in couple of days. Then the only thing that will need to provide is your dataset in COCO format and everything else including compilation will be taken care by this tool.

Also take a look at a similar thread that reported issues and it seems to be resolved by using the example in benchmark_custom.py: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1121823/tda4vm-edgeai-benchmark-yolov5-model-compilation-error/4164033#4164033

mathmanu avatar Aug 19 '22 05:08 mathmanu

@mathmanu

Confirmed input_optimization is set to False and the rest of the configuration matches asides from the paths and the output feature 16 bits names list:

    'imagedet-7': dict(
        task_type='detection',
        calibration_dataset=imagedet_calib_dataset,
        input_dataset=imagedet_val_dataset,
        preprocess=preproc_transforms.get_transform_onnx(640, 640,  resize_with_pad=True, backend='cv2', pad_color=[114,114,114]),
        session=sessions.ONNXRTSession(**utils.dict_update(onnx_session_cfg, input_optimization=False, input_mean=(0.0, 0.0, 0.0), input_scale=(0.003921568627, 0.003921568627, 0.003921568627)),
            runtime_options=utils.dict_update(settings.runtime_options_onnx_np2(),
                                {'object_detection:meta_arch_type': 6,
                                 'object_detection:meta_layers_names_list':f'../edgeai-yolov5/weights/test_640s_ti_lite/test_640s_ti_lite.prototxt',
                                 'advanced_options:output_feature_16bit_names_list':'onnx::Reshape_291, onnx::Reshape_347, onnx::Reshape_403'
                                 }),
            model_path=f'../edgeai-yolov5/weights/test_640s_ti_lite/test_640s_ti_lite.onnx'),
        postprocess=postproc_transforms.get_transform_detection_yolov5_onnx(squeeze_axis=None, normalized_detections=False, resize_with_pad=True, formatter=postprocess.DetectionBoxSL2BoxLS()), 
        
        metric=dict(label_offset_pred=datasets.coco_det_label_offset_80to90(label_offset=1)),
        model_info=dict(metric_reference={'accuracy_ap[.5:.95]%':37.4})
    ),

Sounds good, hopefully that removes any bugs in the process compiling a model from the edgeai-yolov5 repository.

I reviewed the linked thread, the only difference I see, and maybe you noticed this as well if you compared the .prototxts, is that my .prototxt has only 3 yolo_param blocks/layers whereas the one in this thread and others I have seen have 4 when training on YOLOv5s6 from Ti. Additionally, the "input" attributes are mapped to integer values, whereas mine contains an "onnx::Reshape_" prefix. I did not deviate or do any significant customizations in the training process, so I'm not sure why these differences are appearing.

dpetersonVT23 avatar Aug 19 '22 13:08 dpetersonVT23

Hi Can you try removing your onnx package and install the version 1.8.1. Then export the onnx model once again and try. The reason why I am asking is because we just integrated edgeai-yolov5 into edgeai-modelmaker and it worked without issue.

Or you can wait for a day and we shall update edgeai-modelmaker tomorrow with yolov5 support.

mathmanu avatar Aug 22 '22 08:08 mathmanu

Im assuming you are referring to the export process from .pt to .onnx and .prototxt in the edgeai-yolov5 repository. I will attempt to downgrade onnx package from 1.11.0 to 1.8.1 and export, but the requirements.txt says >=1.9.0 for onnx package in this repository. My separate conda environment for the edgeai-benchmark repository already had 1.8.1 installed. If this does not solve the issue I will wait for your release in the near future and go from there, thank you!

dpetersonVT23 avatar Aug 22 '22 12:08 dpetersonVT23

@mathmanu I received this error when trying to export to .onnx and .prototxt in edgeai-yolov5 repo using onnx==1.8.1:

ImportError: /home/user/miniconda3/lib/python3.9/site-packages/torch/lib/../../../../libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /lib/x86_64-linux-gnu/libprotobuf.so.23)

Followed suggested fixes online and nothing worked. Script worked once I ran with 1.9.0 or above, consistent with the requirements. Let me know if you were referring to a different script for exporting to onnx model.

dpetersonVT23 avatar Aug 22 '22 12:08 dpetersonVT23

Can you try to use a lower Python version (You can create and environment in miniconda). Try Python3.6

mathmanu avatar Aug 22 '22 13:08 mathmanu

That export worked. The .prototxt no longer has the "onnx::Reshape_" prefixes I mentioned, however, there are still only 3 yolo_param blocks, but this may be normal since I am training a smaller model with only 1 class. When I run the custom benchmark script to compile and get the artifacts, I get an error about the Provider Type. I am not sure why I do not have the TIDL version, but the CPU version I do have, let me know your thoughts.

6: UserWarning: Specified provider 'TIDLExecutionProvider' is not in available provider names.Available providers: 'CPUExecutionProvider' "Available providers: '{}'".format(name, ", ".join(available_provider_names)))

Unknown Provider Type: TIDLExecutionProvider Traceback (most recent call last): File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/pipeline_runner.py", line 135, in _run_pipeline accuracy_result = accuracy_pipeline(description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 104, in call param_result = self._run(description=description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 146, in _run output_list = self._infer_frames(description) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/pipelines/accuracy_pipeline.py", line 194, in _infer_frames is_ok = session.start_infer() File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/sessions/onnxrt_session.py", line 80, in start_infer self.interpreter = self._create_interpreter(is_import=False) File "/home/mm282681/Documents/yolo/localv5ti/edgeai-benchmark/jai_benchmark/sessions/onnxrt_session.py", line 132, in _create_interpreter provider_options=[runtime_options, {}], sess_options=sess_options) File "/home/mm282681/.local/lib/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "/home/mm282681/.local/lib/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 379, in _create_inference_session sess.initialize_session(providers, provider_options, disabled_optimizers) RuntimeError: Unknown Provider Type: TIDLExecutionProvider

dpetersonVT23 avatar Aug 22 '22 13:08 dpetersonVT23

This is good progress. We need to specify Python and onnx versions in our requirements. CC: @debapriyamaji

RuntimeError: Unknown Provider Type: TIDLExecutionProvider The error may mean that the correct onnxruntime for TIDL is not installed: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/setup.sh#L69

Or it may mean that the tidl_tools folder is not found: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_setup_env.sh#L52 https://github.com/TexasInstruments/edgeai-benchmark/blob/master/run_custom_pc.sh#L36 https://github.com/TexasInstruments/edgeai-benchmark/blob/master/setup.sh#L62 https://github.com/TexasInstruments/edgeai-benchmark/blob/master/setup.sh#L71 https://github.com/TexasInstruments/edgeai-benchmark/blob/master/setup.sh#L71

mathmanu avatar Aug 22 '22 14:08 mathmanu

Awesome to hear that has been figured out.

I'll rerun the setup script and then run the custom benchmark script again and let you know if that fixes the issue.

dpetersonVT23 avatar Aug 22 '22 14:08 dpetersonVT23

@mathmanu Ran the setup script again, same error occurred when running run_custom_pc.sh. Manually removed the tidl_tools directory and the tidl_tools.tar.gz file. After running setup.sh, I manually ran line 69 and 71 from setup.sh to confirm both were installed correctly as well.

Running run_setup_env with pc argument prints the correct path for tidl_tools folder.

If it helps at all, I am working on a Linux running Ubuntu 22.04, but I do not think this is a contributing factor to the error.

tidl_tools folder has the following contents: ├── device_config.cfg ├── gcc-arm-9.2-2019.12-x86_64-aarch64-none-linux-gnu ├── itidl_rt.h ├── libtidl_onnxrt_EP.so ├── libtidl_tfl_delegate.so ├── libvx_tidl_rt.so ├── libvx_tidl_rt.so.1.0 ├── PC_dsp_test_dl_algo.out ├── ti_cnnperfsim.out ├── tidl_graphVisualiser.out ├── tidl_graphVisualiser_runtimes.out ├── tidl_model_import_onnx.so ├── tidl_model_import_relay.so └── tidl_model_import_tflite.so

dpetersonVT23 avatar Aug 22 '22 15:08 dpetersonVT23

Yolov5 support has been now added in edgeai-modelmaker: https://github.com/TexasInstruments/edgeai-modelmaker You can change the model to be trained in the comfig file: https://github.com/TexasInstruments/edgeai-modelmaker/blob/master/config_detection.yaml#L44

We still need to enhance the yolov5 support - for example changing the learning rate is not yet enabled - but wanted to release a version quickly as you have been waiting. Hopefully we can do the pending things an push a complete version tomorrow.

Please try and let us know. Be sure to create a fresh python 3.6 environment for this.

mathmanu avatar Aug 23 '22 14:08 mathmanu

I encountered an error when trying to run the detection example.

I have CUDA 11.3. After running setup_all.sh and running the detection example I encountered this error:

RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA versions. PyTorch has CUDA Version=11.3 and torchvision has CUDA Version=11.7. Please reinstall the torchvision that matches your PyTorch install.

PyTorch seems to be right, but I am not sure why torchvision was installed with CUDA 11.7, seems to be related to edgeai-torchvision. I removed the edgeai-torchvision directory and pip installed the correct torchvision for CUDA 11.3. The script was then missing modules from edgeai-torchvision. Will have to spend time figuring out a workaround for this.

In the meantime, working on converting VoTT export to COCO format.

dpetersonVT23 avatar Aug 23 '22 18:08 dpetersonVT23

If you have multiple CUDA versions (for example both CUDA 11.3 and CUDA 11.7) installed, then it is possible that the setup of edgeai-torchvision can take the wrong CUDA version. This can be corrected by setting LD_LIBRARY_PATH to the correct CUDA version.

mathmanu avatar Aug 24 '22 05:08 mathmanu

Also, we have been using a gcc version of 7.x (specifically 7.5). We have noticed issues when installing edgeai-torchvision when gcc version is 5.x.

You are using Ubuntu 22.04, which could have a different gcc version. If the above doesn't solve the issue, you can try after installing gcc-7 and g++-7

It is easy to have multiple gcc versions and switch between then using update-alternatives

mathmanu avatar Aug 24 '22 06:08 mathmanu

If you still have issues, you can use the docker build scripts that we have given here (https://github.com/TexasInstruments/edgeai-modelmaker) to bring up a docker container and use modelmaker inside the container.

mathmanu avatar Aug 24 '22 08:08 mathmanu

Used the docker build scripts, ran the setup_all script, and ran the detection and classification examples. I received this error, seems to be regarding downloading the dataset. I received the same error running both examples.

argv: ['./scripts/run_modelmaker.py', 'config_detection.yaml'] Model:yolox_s_lite_mmdet TargetDevice:TDA4VM FPS(Estimate):107 downloading from http://software-dl.ti.com/jacinto7/esd/modelzoo/latest/datasets/tiscapes2017_driving.zip to /home/edgeai/code/edgeai-modelmaker/data/projects/tiscapes2017_driving/other/download/tiscapes2017_driving.zip HTTP Error 403: Forbidden Traceback (most recent call last): File "./scripts/run_modelmaker.py", line 127, in main(config) File "./scripts/run_modelmaker.py", line 66, in main run_params_file = model_runner.prepare() File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/ai_modules/vision/runner.py", line 96, in prepare self.dataset_handling.run() File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/ai_modules/vision/datasets/init.py", line 121, in run self.params.dataset.extract_path) File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/utils/download_utils.py", line 183, in download_file progressbar_creator=progressbar_creator) File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/utils/download_utils.py", line 138, in download_and_extract extract_success = extract_files(dataset_url, extract_root) File "/home/edgeai/code/edgeai-modelmaker/edgeai_modelmaker/utils/download_utils.py", line 70, in extract_files if download_file.endswith('.tar'): AttributeError: 'NoneType' object has no attribute 'endswith'

dpetersonVT23 avatar Aug 24 '22 13:08 dpetersonVT23

HTTP Error 403: Forbidden

Network issue?

mathmanu avatar Aug 24 '22 13:08 mathmanu