docling icon indicating copy to clipboard operation
docling copied to clipboard

feat(ocr): added support for RapidOCR engine

Open Swaymaw opened this issue 1 year ago • 5 comments

  • Added RapidOCR Model as an OCR engine option.
  • Added Options for configuring RapidOCR model during document conversion using pipeline options.
  • Updates documentation, added tests and updated dependencies(extras) to reflect the added engine support.
  • Updated examples to demonstrate the use of RapidOcrOptions.

This change allows users to seamlessly work with RapidOCR-OnnxRuntime engine which provides higher accuracy and performance in use-cases which require working with complex PDF files.

Checklist:

  • [x] Commit Message Formatting: Commit titles and messages follow guidelines in the conventional commits.
  • [x] Documentation has been updated, if necessary.
  • [x] Examples have been added, if necessary.
  • [x] Tests have been added, if necessary.

Swaymaw avatar Nov 22 '24 07:11 Swaymaw

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?:

mergify[bot] avatar Nov 22 '24 07:11 mergify[bot]

@Swaymaw would you suggest we need both PaddleOCR and RapidOCR in Docling? Or one of the two is enough?

dolfim-ibm avatar Nov 25 '24 07:11 dolfim-ibm

Please see the test results, can you please address those?

dolfim-ibm avatar Nov 25 '24 08:11 dolfim-ibm

@Swaymaw would you suggest we need both PaddleOCR and RapidOCR in Docling? Or one of the two is enough?

I would say that we can choose to only stick with RapidOCR as it is much faster than PaddleOCR with the same accuracy and at the same time much simpler to install and work with. RapidOCR, also makes it easier to train and run inference with custom detection , classification and recognition model paths which will improve the overall usability of the framework with use-case specific models.

Swaymaw avatar Nov 25 '24 09:11 Swaymaw

Ok, let's then focus on getting this PR running. There are still a few installation issue in CI for onnx.

dolfim-ibm avatar Nov 25 '24 16:11 dolfim-ibm

@Swaymaw Thanks for the configuration options enhancements, this is matching what I had in mind.

However, to better align with an in-development global configuration system in docling (see here) without breaking this config interface down the line, we will take the liberty of temporarily hiding all the device-related configuration options to users in RapidOcrOptions and make the AUTO the implicit default. As such, we don't need to delay the merge of this PR and we will revisit how to expose the configuration options short-term.

cau-git avatar Nov 27 '24 10:11 cau-git