PaddleOCR icon indicating copy to clipboard operation
PaddleOCR copied to clipboard

fix dataset problems release 2.7

Open yiakwy-xpu-ml-framework-team opened this issue 11 months ago • 6 comments

PR 类型 PR types

Bug fixes

PR 变化内容类型 PR changes

Dataset, preprocessing :

  • remove duplicated data loads of simple dataset
  • handling detection task for CTW dataset where annotation polyline data contains more than 4 points (otherwise, resulting forever loop in dataset)
  • handling inChannels scalars
  • fasten data loading
  • improve model loading

描述 Description

improve codes stability and fix bugs when training with cn_PP_OCR_v4 detection model.

I identified failures of dataset infinite loading problem when reproducing a OCR dec in local env. My team is using this model to generate high quality data for Multi-Modal LLM. After this fix , dec can be trained very quickly for a few epochs :

截屏2024-02-29 15 09 45

SimpleDataset forever loop :

for dataset such as ctw1500, each annotation of points of which contains more than 4 points to detect location of arbitrary text in the image. The points may be of the shape [2/*the number of annotation in this image*/, 14/*number of points*/, 2/*2d coordinates*/].

The utility function should also be adapted to process the annotation data instead of throwing them away or raising an exception.

Note by default, "CopyPaste" for imgaug method is chosen. That means arbitrary of two images are selected and paired. However get_ext_data always read a new image from the dataset, which is very inefficient.

提PR之前的检查 Check-list

  • [ ] 这个 PR 是提交到dygraph分支或者是一个cherry-pick,否则请先提交到dygarph分支。 This PR is pushed to the dygraph branch or cherry-picked from the dygraph branch. Otherwise, please push your changes to the dygraph branch.
  • [x] 这个PR清楚描述了功能,帮助评审能提升效率。This PR have fully described what it does such that reviewers can speedup.
  • [x] 这个PR已经经过本地测试。This PR can be covered by existing tests or locally verified.

Thanks for your contribution!

paddle-bot[bot] avatar Feb 29 '24 06:02 paddle-bot[bot]

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Feb 29 '24 06:02 CLAassistant

B.t.w I just found this line when I scan the dataset file:

resized_image = resized_image.astype('float32')

The image can be effectively worked with uint8, this allows 8-bit IO workload and better compute-IO overlap.

It seems that we havn't use DALI to accelerate IO by allowing device prefetching and on-chip preprocessing.

Can we land this one ? @tink2123

jzhang533 avatar Mar 29 '24 05:03 jzhang533

@yiakwy-xpu-ml-framework-team you'll need to create a PR, target for dygraph branch. we decided not merging this into release/2.7.

jzhang533 avatar Mar 29 '24 05:03 jzhang533

@yiakwy-xpu-ml-framework-team you'll need to create a PR, target for dygraph branch. we decided not merging this into release/2.7.

I guess I can cherry-pick this commit onto the dygraph branch. I will do it soon.