data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

Formula patch bridge from docling + Rapidocr and ocrmac additions + GPU accelerator patch + VLM backend

Open ShiroYasha18 opened this issue 4 months ago • 3 comments

Why are these changes needed?

Bridges the -- enrich_formula from docling for the formula patch I also added the 2 new ocr engines which was added to docling - ocrmac and the rapidocr as it was through the same pipeline_options

Update : Added the GPU accelerator patch as it was being asked by couple of folks in discussions and issues . Imported necessary pipeline options from Docling and wrote a test for the same .

Using GPU accelerator can boost the speed by upto 50% baremetal

Update part 2 : Added the VLM backend functionality with the necessary pipeline options . Added all the VLM models .

Related issue number (if any).

In reference to the issue https://github.com/data-prep-kit/data-prep-kit/issues/1391 GPU Accelerator issue : https://github.com/data-prep-kit/data-prep-kit/issues/1347 VLM backend : https://github.com/data-prep-kit/data-prep-kit/issues/1145

ShiroYasha18 avatar Aug 07 '25 18:08 ShiroYasha18

thanks for your PR @ShiroYasha18 . In order to provide a more efficient review, can you please split this PR by the referenced issue (i.e. 1 PR per issue listed). thanks

swith005 avatar Aug 11 '25 18:08 swith005

@ShiroYasha18 I remember mentioning somewhere (maybe the PRs that have been closed now) that you need to run make generate-expected after a change in the docling2parquet, to generate the expected output files again, in order to pass the failing CI/CD test-src test. The new generated output files will be part of your submitted PR.

shahrokhDaijavad avatar Aug 11 '25 20:08 shahrokhDaijavad

@swith005 @shahrokhDaijavad Thanks for the feedback ! On it . This PR will be divided into 3 new PRs :

  1. VLM backend addition
  2. GPU support addition
  3. Formula patch + New OCR additions

ShiroYasha18 avatar Aug 12 '25 18:08 ShiroYasha18