kotaemon feat: integrate got-ocr2.0 as image reader

Description

Integrate the got-ocr2.0 OCR as image reader
New extension manager for easily switch between different supported loaders
Also, thanks @cin-jimmy for his suggestion on github stale (issue)

Type of change

[x] New features (non-breaking change).
[ ] Bug fix (non-breaking change).
[ ] Breaking change (fix or feature that would cause existing functionality not to work as expected).

Checklist

[x] I have performed a self-review of my code.
[ ] I have added thorough tests if it is a core feature.
[ ] There is a reference to the original bug report and related work.
[ ] I have commented on my code, particularly in hard-to-understand areas.
[ ] The feature is well documented.

Oct 02 '24 15:10 phv2312

@phv2312, can you add a docker-compose file (allow choose the docker image for OCR service)? I think it will help people test more easily.

Oct 04 '24 08:10 cin-niko

Hi @taprosoft @cin-niko. Sorry for no update for long time. Can you help to review this PR again

Oct 26 '24 04:10 phv2312

Hi @cin-niko and @taprosoft . I have updated according to niko's comments and rebased from the latest master already. Can you help to check this PR again ?

Dec 15 '24 11:12 phv2312

@phv2312 Overall is good. But it seems that setting the loader for extensions feature doesn't work. For example:

Set pdf loader in Settings -> Retrieval Settings -> File loader: Work
Set pdf loader in Settings -> Loader settings -> Loader .pdf: Doesn't work

Dec 16 '24 05:12 cin-niko

@phv2312 sorry for the late comment. Overall the logic is fine but the current settings UI is a bit cluttered. I will push a small change to improve this prior to merging.

Dec 16 '24 06:12 taprosoft

Need OCR for PDFs working as well, when I upload a PDF which is created by a scanner that scans in pages, it puts the scanned pages into a multi-page PDF. Need to be able to upload the PDF, which then it does an OCR to extract the text and use that extracted text for reasoning.

Jun 04 '25 17:06 heapsoftware