semantic-kernel Support PDF Document Connector

Add support to read text from an already OCR'ed PDF file. This would expand the capability of DocumentSkills to read PDF file.

Happy to work on a PR :).

Mar 27 '23 19:03 anhhnguyen206

We'd love to have you add a PR for this!

Mar 28 '23 18:03 alexchaomander

@TaoChenOSU This is related to what you did in the Copilot Chat sample...

May 12 '23 06:05 glahaye

@anhhnguyen206 , we have the ability to read text from a PDF in the copilot sample (https://github.com/microsoft/semantic-kernel/tree/main/samples/apps/copilot-chat-app/importdocument ) . Check that out and please do a PR if you want additional functionality.

May 16 '23 20:05 evchaki

@anhhnguyen206 , we have the ability to read text from a PDF in the copilot sample (https://github.com/microsoft/semantic-kernel/tree/main/samples/apps/copilot-chat-app/importdocument ) . Check that out and please do a PR if you want additional functionality.

The link provided in the above discussion does not exist now. Do you have an updated link

Dec 23 '23 02:12 VenkateshSrini

@VenkateshSrini Here's an updated link: https://github.com/microsoft/chat-copilot/tree/main/tools/importdocument

Dec 23 '23 23:12 glahaye

Thanks. Can you please let me know why the PDF file is parsed and content are extracted. My usecase is that I have to extract the content and then perform a search on the same. The file cannot be sent outside for parsing purpose

Dec 24 '23 02:12 VenkateshSrini

@VenkateshSrini I am not sure what you mean. Can you elaborate?

Dec 24 '23 03:12 glahaye

My objective is to perform a natural language search on multiple pdf documents uploaded by the user using a chat interface. The user uploads the PDF document to my API. The requirement is

Parse the PD and extract the content. The PDF files may contain images that have some texts. So, I must extract all the content in the PDF file.
The PDF files cannot cross the boundaries of my API server. That means the PDF files cannot be exported to some other services of Azure or AWS to extract the content.
I cannot upload the PDF to inference servers (LLM API exposure) to parse the content and return value.

What do I propose to do?

Get the content of PDF from file as pages? => This is where I need to understand how the PDF is parsed and contents are extracted? Is it done locally in the code or uploaded some service to do it.
Once extracted I can generate the embedding for each page. This I know
Generate embedding for provided text This I know
Do a vector search. This I plan to use MongoDB vector search as Mongo is my storage for extracted content from PDF

Dec 24 '23 04:12 VenkateshSrini

@VenkateshSrini You can use PigPDF to extract the text from the PDF.

Jan 05 '24 00:01 glahaye