docling
docling copied to clipboard
feat(ocr): added support for RapidOCR engine
- Added RapidOCR Model as an OCR engine option.
- Added Options for configuring RapidOCR model during document conversion using pipeline options.
- Updates documentation, added tests and updated dependencies(extras) to reflect the added engine support.
- Updated examples to demonstrate the use of RapidOcrOptions.
This change allows users to seamlessly work with RapidOCR-OnnxRuntime engine which provides higher accuracy and performance in use-cases which require working with complex PDF files.
Checklist:
- [x] Commit Message Formatting: Commit titles and messages follow guidelines in the conventional commits.
- [x] Documentation has been updated, if necessary.
- [x] Examples have been added, if necessary.
- [x] Tests have been added, if necessary.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?:
@Swaymaw would you suggest we need both PaddleOCR and RapidOCR in Docling? Or one of the two is enough?
Please see the test results, can you please address those?
@Swaymaw would you suggest we need both PaddleOCR and RapidOCR in Docling? Or one of the two is enough?
I would say that we can choose to only stick with RapidOCR as it is much faster than PaddleOCR with the same accuracy and at the same time much simpler to install and work with. RapidOCR, also makes it easier to train and run inference with custom detection , classification and recognition model paths which will improve the overall usability of the framework with use-case specific models.
Ok, let's then focus on getting this PR running. There are still a few installation issue in CI for onnx.
@Swaymaw Thanks for the configuration options enhancements, this is matching what I had in mind.
However, to better align with an in-development global configuration system in docling (see here) without breaking this config interface down the line, we will take the liberty of temporarily hiding all the device-related configuration options to users in RapidOcrOptions and make the AUTO the implicit default. As such, we don't need to delay the merge of this PR and we will revisit how to expose the configuration options short-term.