Formula patch bridge from docling + Rapidocr and ocrmac additions + GPU accelerator patch + VLM backend
Why are these changes needed?
Bridges the -- enrich_formula from docling for the formula patch I also added the 2 new ocr engines which was added to docling - ocrmac and the rapidocr as it was through the same pipeline_options
Update : Added the GPU accelerator patch as it was being asked by couple of folks in discussions and issues . Imported necessary pipeline options from Docling and wrote a test for the same .
Using GPU accelerator can boost the speed by upto 50% baremetal
Update part 2 : Added the VLM backend functionality with the necessary pipeline options . Added all the VLM models .
Related issue number (if any).
In reference to the issue https://github.com/data-prep-kit/data-prep-kit/issues/1391 GPU Accelerator issue : https://github.com/data-prep-kit/data-prep-kit/issues/1347 VLM backend : https://github.com/data-prep-kit/data-prep-kit/issues/1145
thanks for your PR @ShiroYasha18 . In order to provide a more efficient review, can you please split this PR by the referenced issue (i.e. 1 PR per issue listed). thanks
@ShiroYasha18 I remember mentioning somewhere (maybe the PRs that have been closed now) that you need to run make generate-expected after a change in the docling2parquet, to generate the expected output files again, in order to pass the failing CI/CD test-src test. The new generated output files will be part of your submitted PR.
@swith005 @shahrokhDaijavad Thanks for the feedback ! On it . This PR will be divided into 3 new PRs :
- VLM backend addition
- GPU support addition
- Formula patch + New OCR additions