semantic-kernel icon indicating copy to clipboard operation
semantic-kernel copied to clipboard

Support PDF Document Connector

Open anhhnguyen206 opened this issue 2 years ago • 2 comments

Add support to read text from an already OCR'ed PDF file. This would expand the capability of DocumentSkills to read PDF file.

Happy to work on a PR :).

anhhnguyen206 avatar Mar 27 '23 19:03 anhhnguyen206

We'd love to have you add a PR for this!

alexchaomander avatar Mar 28 '23 18:03 alexchaomander

@TaoChenOSU This is related to what you did in the Copilot Chat sample...

glahaye avatar May 12 '23 06:05 glahaye

@anhhnguyen206 , we have the ability to read text from a PDF in the copilot sample (https://github.com/microsoft/semantic-kernel/tree/main/samples/apps/copilot-chat-app/importdocument ) . Check that out and please do a PR if you want additional functionality.

evchaki avatar May 16 '23 20:05 evchaki

@anhhnguyen206 , we have the ability to read text from a PDF in the copilot sample (https://github.com/microsoft/semantic-kernel/tree/main/samples/apps/copilot-chat-app/importdocument ) . Check that out and please do a PR if you want additional functionality.

The link provided in the above discussion does not exist now. Do you have an updated link

VenkateshSrini avatar Dec 23 '23 02:12 VenkateshSrini

@VenkateshSrini Here's an updated link: https://github.com/microsoft/chat-copilot/tree/main/tools/importdocument

glahaye avatar Dec 23 '23 23:12 glahaye

Thanks. Can you please let me know why the PDF file is parsed and content are extracted. My usecase is that I have to extract the content and then perform a search on the same. The file cannot be sent outside for parsing purpose

VenkateshSrini avatar Dec 24 '23 02:12 VenkateshSrini

@VenkateshSrini I am not sure what you mean. Can you elaborate?

glahaye avatar Dec 24 '23 03:12 glahaye

My objective is to perform a natural language search on multiple pdf documents uploaded by the user using a chat interface. The user uploads the PDF document to my API. The requirement is

  1. Parse the PD and extract the content. The PDF files may contain images that have some texts. So, I must extract all the content in the PDF file.
  2. The PDF files cannot cross the boundaries of my API server. That means the PDF files cannot be exported to some other services of Azure or AWS to extract the content.
  3. I cannot upload the PDF to inference servers (LLM API exposure) to parse the content and return value.

What do I propose to do?

  • Get the content of PDF from file as pages? => This is where I need to understand how the PDF is parsed and contents are extracted? Is it done locally in the code or uploaded some service to do it.
  • Once extracted I can generate the embedding for each page. This I know
  • Generate embedding for provided text This I know
  • Do a vector search. This I plan to use MongoDB vector search as Mongo is my storage for extracted content from PDF

VenkateshSrini avatar Dec 24 '23 04:12 VenkateshSrini

@VenkateshSrini You can use PigPDF to extract the text from the PDF.

glahaye avatar Jan 05 '24 00:01 glahaye