Support PDF Document Connector
Add support to read text from an already OCR'ed PDF file. This would expand the capability of DocumentSkills to read PDF file.
Happy to work on a PR :).
We'd love to have you add a PR for this!
@TaoChenOSU This is related to what you did in the Copilot Chat sample...
@anhhnguyen206 , we have the ability to read text from a PDF in the copilot sample (https://github.com/microsoft/semantic-kernel/tree/main/samples/apps/copilot-chat-app/importdocument ) . Check that out and please do a PR if you want additional functionality.
@anhhnguyen206 , we have the ability to read text from a PDF in the copilot sample (https://github.com/microsoft/semantic-kernel/tree/main/samples/apps/copilot-chat-app/importdocument ) . Check that out and please do a PR if you want additional functionality.
The link provided in the above discussion does not exist now. Do you have an updated link
@VenkateshSrini Here's an updated link: https://github.com/microsoft/chat-copilot/tree/main/tools/importdocument
Thanks. Can you please let me know why the PDF file is parsed and content are extracted. My usecase is that I have to extract the content and then perform a search on the same. The file cannot be sent outside for parsing purpose
@VenkateshSrini I am not sure what you mean. Can you elaborate?
My objective is to perform a natural language search on multiple pdf documents uploaded by the user using a chat interface. The user uploads the PDF document to my API. The requirement is
- Parse the PD and extract the content. The PDF files may contain images that have some texts. So, I must extract all the content in the PDF file.
- The PDF files cannot cross the boundaries of my API server. That means the PDF files cannot be exported to some other services of Azure or AWS to extract the content.
- I cannot upload the PDF to inference servers (LLM API exposure) to parse the content and return value.
What do I propose to do?
- Get the content of PDF from file as pages? => This is where I need to understand how the PDF is parsed and contents are extracted? Is it done locally in the code or uploaded some service to do it.
- Once extracted I can generate the embedding for each page. This I know
- Generate embedding for provided text This I know
- Do a vector search. This I plan to use MongoDB vector search as Mongo is my storage for extracted content from PDF
@VenkateshSrini You can use PigPDF to extract the text from the PDF.