chat-ui icon indicating copy to clipboard operation
chat-ui copied to clipboard

Generalize RAG + PDF Chat feature

Open mishig25 opened this issue 1 year ago • 26 comments

TLDR: implement PDF-chat feature

Update 1 here

Closes #609

When user uploads a PDF:

  1. Parse the PDF text, create embeddings, save embeddings in files bucket (that is also used for saving images for multimodal models)
  2. On the next messages in that conversation, use the PDF embeddings for RAG

Limitations

  1. Parse first 20 pages of pdf (can increase it or decrease it)
  2. A conversation can currently have only one uploaded PDF. When a user uploads PDF, it overwrites the existing PDF if there was any
  3. When user enables websearch, then websearch RAG is used, PDF RAG is not used.
  4. Just like Websearch Rag, when Pdf rag is enabled, every message of that conversation will use PDF Rag. (In subsequent PR, we need to use prompting and other techniques that will make the tool usage only when it makes sense)

Testing locally

*install new pdf-parse dependency with npm ci

npm ci 
npm run dev -- --open

Screen recording

Testing by uploading Mamba paper

https://github.com/huggingface/chat-ui/assets/11827707/acae74fc-e70e-485f-9206-e8d539d5d90a

mishig25 avatar Dec 18 '23 14:12 mishig25

:fire: So cool, will test it locally later, but just from looking at the demo, do you think there's an easy way to show an indicator of when a PDF is already uploaded and will be sent with the message ?

nsarrazin avatar Dec 18 '23 15:12 nsarrazin

but just from looking at the demo, do you think there's an easy way to show an indicator of when a PDF is already uploaded and will be sent with the message ?

at the moment, there is websearch-like box that indicates pdf rag was used

image

mishig25 avatar Dec 18 '23 15:12 mishig25

I meant more when the file is loaded and before the conversation is started, like for images: image

I guess it would look a bit different since you can only have one PDF per conversation, but it would be nice to have an indication that a PDF will be used to answer the query :eyes:

nsarrazin avatar Dec 18 '23 15:12 nsarrazin

I see. Let me think about it

mishig25 avatar Dec 18 '23 15:12 mishig25

Small note, if I drag and drop a non-pdf file I get the following weird output Screenshot from 2023-12-21 13-40-59

nsarrazin avatar Dec 21 '23 14:12 nsarrazin

Updates:

  1. Fixed this upload non-img file bug here
  2. Provided better UI/UX for uploaded file (see attached video). Specifically: 1. name if the uploaded pdf with pdf icon appears; 2. this pdf file name and icon does "blinking" animation while the pdf is being uploaded & embeddings are being created; 3. on hover x btn appears, that let's you delete the uploaded PDF file
  3. env var (config) for enabling pdf-chat feature as here

https://github.com/huggingface/chat-ui/assets/11827707/723cd355-edea-4204-98db-0632546b3cf4

What I'm working on now:

  1. Address nathan's comments here & here
  2. Test creating embeddings on a large PDF file with TEI https://github.com/huggingface/chat-ui/pull/646

mishig25 avatar Jan 09 '24 12:01 mishig25

Thanks for working on this! One question I had is: what was your thought process for adding a new upload button for PDFs, versus using the existing drag-and-drop functionality that already exists for images?

wdhorton avatar Jan 09 '24 22:01 wdhorton

@wdhorton currently the UI might still evolve. For now, the reason for resuing the same upload btn (instead of adding a new upload btn) is: having two different upload btns makes the UI look cluttered, especially on smaller screens

mishig25 avatar Jan 10 '24 09:01 mishig25

I have few questions:

  • Why you are limiting this feature to PDF and not csv, txt, etc?
  • I don't sure you have to use embedding for PDF (or at least make it optional, and not mandatory)
  • Why using Mongo as vectorDB and calculate the vector similarity in client side instead of using some real vectorDB that can to it whey more efficient and fast than running it in JS, this can make the code cleaner and remove any limitation of content size
  • Does storing all the embedding is good idea? this can make the DB blow relatively fast.

itaybar avatar Jan 10 '24 11:01 itaybar

@itaybar, thanks a lot for your questions

Why you are limiting this feature to PDF and not csv, txt, etc?

Yes, we will add support to other text files. Once this PR is done, supporting other text files would be trivial. (Might even include as part of this PR)

I don't sure you have to use embedding for PDF (or at least make it optional, and not mandatory)

Could you elaborate on it. And what would be the alternatives?

Why using Mongo as vectorDB and calculate the vector similarity in client side instead of using some real vectorDB that can to it whey more efficient and fast than running it in JS, this can make the code cleaner and remove any limitation of content size. Does storing all the embedding is good idea? this can make the DB blow relatively fast.

Indeed this is a good point. I/we will add support for vectorDB (likely part of this PR)

mishig25 avatar Jan 12 '24 10:01 mishig25

Update: this PR is getting big. Unfortunately, there is no other option (I think). The specific points are:

  1. In #689, I've generalized RAG. What does it mean? It means two things: 1. server side. 2. frontend side. In terms of 1. server side, RAG applications have to implement a RAG interface for consistency & better organization of codebase (you can checkout directory src/lib/server/rag). In terms of 2. server side, OpenWebSearchResults svelte component is generalized into OpenRAGResults svelte component that will show up on RAG augmented messages (as suggested here).
  2. Besides creating pdf embeddings through TEI for pdf-chat, we would need vectorDB support for multiple reasons:
    1. without vectorDB, the pdf-chat session will lose the pdf embeddings (for instance, when you close your browser & re-open the same chat-ui conversation, the pdf embeddings will no longer be available)
    2. storing PDF embeddings on mongo gridFS would slow-down performance (as questioned here) & large number of embeddings can cause a lot of latency on the server since findSimilarSentences runs locally on the server. Therefore, we would need support for vectorDB.
    3. We would need vectorDB for other features as well. For instance, we would need vecorDB + pdf RAG for #639. There was also internal slack discussion here.
    4. VectorDB support is more general feature that can have multiple applications. For instance, PDF-chat is just an instance (special case) of vectorDB chat since in PDF-chat, one uses the PDF to populate VectorDB & afterwards it becomes just chat with vectorDB
    5. for the info, checking openai/chatgpt-retrieval-plugin to see if we would need to follow commonly used API for vectorDBs

Should I open a PR for vectorDB support against this branch? wdyt @nsarrazin @gary149

mishig25 avatar Jan 12 '24 10:01 mishig25

@itaybar, thanks a lot for your questions

Why you are limiting this feature to PDF and not csv, txt, etc?

Yes, we will add support to other text files. Once this PR is done, supporting other text files would be trivial. (Might even include as part of this PR)

I don't sure you have to use embedding for PDF (or at least make it optional, and not mandatory)

Could you elaborate on it. And what would be the alternatives?

Why using Mongo as vectorDB and calculate the vector similarity in client side instead of using some real vectorDB that can to it whey more efficient and fast than running it in JS, this can make the code cleaner and remove any limitation of content size. Does storing all the embedding is good idea? this can make the DB blow relatively fast.

Indeed this is a good point. I/we will add support for vectorDB (likely part of this PR)

Thanks for the quick response guys! About the optional embedding for the pdf, correct me if I wrong, but you can just read the texts from the pdf without using embedding for the images, etc and by that removing the hard limits for content file

itaybar avatar Jan 12 '24 10:01 itaybar

@mishig25 What is your estimation for merging this in your opinion?

itaybar avatar Jan 16 '24 14:01 itaybar

Can't give exact date. But will do my best to merge it soon :)

mishig25 avatar Jan 16 '24 16:01 mishig25

Merge. hire 10 more people to help.

johndpope avatar Feb 03 '24 21:02 johndpope

Merge

Will merge soon

hire 10 more people to help.

Hiring 10 more people rarely results in 2x productivity (let alone 10x)

mishig25 avatar Feb 04 '24 10:02 mishig25

It's a tough one to implement. I appreciate the work on this one.

I had a couple of thoughts on this I'd like to share.

First, I thought of a plugin-like system that is registered based on the file type, has handler for processing/storing and handler for retrieval. That would allow community to scale while not choking the HF developers which I'm beyond impressed being able to deliver such variety of products.

Alternatively, it could also be a third party API - much like OpenAI function calling works - so that the responsibility is NOT on you but on the end user who chooses to deploy. It's great for developers but would probably be a pain for the those who just one to feel the power of deploying and has no use beyond (inspired by the shutdown story of banana.dev).

I believe these would inherently fit better as the nature of open-source deployments is to customize. As such, there is an infinite number of use cases and solutions... PDFs, images, audio files, web search as input; ChromaDB, Pinecone, Qdrant, PGVector, Meilisearch as storage/retrieval to name a few.

flexchar avatar Feb 06 '24 15:02 flexchar

Hello. I have cloned the "chatPDF" branch to use the pdf upload feature. It works locally on my machine when I run npm run dev. However, when I perform the same thing on my apache2 server, it shows an uploading PDF error. I attaching the screenshot. image

I have tried copying the exact .env.local file from my local directory to my server's directory but it still shows the same error. Should I open a new issue about it? What else can I provide to help you understand the error?

zubu007 avatar Feb 12 '24 13:02 zubu007

Can you ideally skip apache2? Or check error logs from it to see if anything is being logged? Could be that some headers are lost or file being too big and blocked. What does your network devtools say on this upload response?

@zubu007

flexchar avatar Feb 12 '24 17:02 flexchar

The error on the console shows 403 Forbidden error meaning it has something to do with authentication. I am using custom models with separate endpoint rather than the default. However, the chat function with the custom model works as expected. When the pdf is being uploaded, it throws the 403. My question is, the pdf fetch function is using a different authentication?

Let me add further console logs in both my local machine and server to see the difference. I will let you know the results here.

zubu007 avatar Feb 16 '24 10:02 zubu007

I have the same problem as https://github.com/huggingface/chat-ui/issues/693 It retrieves correctly when looking at the prompt and parameters, but the model answers as if it has gotten only the question.

windprak avatar Feb 19 '24 10:02 windprak

Let me state a problem I am facing simply. Using the chatPDF with the default HF model ("name": "mistralai/Mistral-7B-Instruct-v0.1") it works. But I am trying to add model to the .env.local file as an openai endpoint and use that model for the chatPDF. The chat function works with both model (default and ours) but the pdf-chat only works with the default one. With the same promt, same pdf, same parameters. I am adding pictures for better understanding.

For HF default model, this was the promt Screenshot 2024-03-04 113803 And this was the response Screenshot 2024-03-04 113718

For our own endpoint, Screenshot 2024-03-04 113819 And this was the response Screenshot 2024-03-04 113726

Is there something I am missing? It must be something simple because changing the model in the App, one works and the other doesnt. If you need more information about the error let me know.

zubu007 avatar Mar 04 '24 10:03 zubu007

hey, whats the current status of the PR?

lmaosweqf1 avatar Mar 27 '24 09:03 lmaosweqf1

Also very interested in this feature. Is there help needed for this? Wasn't sure the blocker or how to help.

C-Loftus avatar Apr 05 '24 20:04 C-Loftus

@mishig25 Thanks for working on this! Very interested in this feature as well thus curious if there are any updates it?

mansur-abdirimov avatar Jun 08 '24 16:06 mansur-abdirimov

Hi @mishig25 and all, from my side I would also offer to help in my freetime if I can, as this is quite an important feature. Reading the above I think that an overview of the required steps is important or someone that coordinates tasks. E.g. is something missing in PR https://github.com/huggingface/chat-ui/pull/745 ? Or is a redesign of the entire RAG-logic including externalization of the API required? Or is there maybe a meeting needed with some overarching architecture etc.?

ruizcrp avatar Jun 12 '24 14:06 ruizcrp