kernel-memory icon indicating copy to clipboard operation
kernel-memory copied to clipboard

Feature Request: Allow enable OCR extraction from PDF

Open GryBsh opened this issue 2 years ago • 5 comments

I'd like to be able to opt-in enable OCRing PDF documents. I understand that tesseract doesn't support this, but Form Recognizer does.

GryBsh avatar Oct 13 '23 11:10 GryBsh

hi @GryBsh the solution supports Azure Form Recognizer, now known as Azure AI Document Intelligence.

dluc avatar Dec 29 '23 04:12 dluc

I also took this issue to our MS account team and I got the same answer. Let me tell you what I told them: Read you're own code: https://github.com/microsoft/kernel-memory/blob/0b8e4cc5592000096f39d80fce1302d24e9e9b39/service/Core/DataFormats/Pdf/PdfDecoder.cs

No, the solution does NOT support OCR of any kind on PDFs. The assumption is made the PDFs have already been OCR'd well. So, I don't think that "completed" tag is very accurate .

GryBsh avatar Dec 30 '23 11:12 GryBsh

Bump

Matt-Scheetz avatar Feb 22 '24 22:02 Matt-Scheetz

@Matt-Scheetz - You can examine this project to see how to integrate tesseract into kernel-memory: https://github.com/microsoft/chat-copilot

Otherwise, Azure Forms Recognizer is supported if you add the configuration: https://github.com/microsoft/kernel-memory/blob/main/service/Service/appsettings.json#L338

crickman avatar Feb 26 '24 16:02 crickman

@GryBsh sorry about the misunderstanding. What I meant to say is that KM has integrated Azure Form Recognizer as an optional OCR solution, however, the integration is used only for images. For PDF KM always uses UglyToad.PdfPig, which is free and was added earlier if I remember correctly.

In order to use Azure Form Recognizer we'll need to make "PDF extraction" configurable, allowing to choose between Azure Doc Intelligence, UglyToad.PdfPig, or any other injectable class. It would be a nice feature to have, though currently we don't have a timeline for it. If someone is willing to work on it and send a PR it would definitely be welcome.

dluc avatar Feb 26 '24 20:02 dluc