gpt4-pdf-chatbot-langchain
gpt4-pdf-chatbot-langchain copied to clipboard
How to upload different PDFs
I am relatively new to all of this, I was able to ingest and run the program successfully and test the chatbot with the law pdf. I am trying to test with other PDF documents, I just pulled a science pdf (22 pages) and created a subfolder within docs called science and put it in. I then ingested it, but it seems to only ingest the law pdf from the repo. I deleted all other pdfs within the 'doc' folder and it seemed to ingest only the science doc but when I go to test on the UI it still acts as if it is the law doc. Am I missing something? Do I need to change the NAME SPACE for each pdf? What files and things should I be looking at for ingesting and interacting with new PDFs and how can I differentiate between folders? I noticed that both a 'law' and 'finance' folder are in the repo so I am assuming this isn't too complicated.
I have the same issue/question as well
The latest commit reads all the pdf files inside the docs
directory, but not the subfolders inside it. You can find the code in ingest-data.ts file.
/* Name of directory to retrieve your files from */
const filePath = 'docs';
export const run = async () => {
try {
/*load raw docs from the all files in the directory */
const directoryLoader = new DirectoryLoader(filePath, {
'.pdf': (path) => new CustomPDFLoader(path),
});
// const loader = new PDFLoader(filePath);
const rawDocs = await directoryLoader.load();
There is a PR to improve the code.
But how to switch between the documents?
The latest commit reads all the pdf files inside the
docs
directory, but not the subfolders inside it. You can find the code in ingest-data.ts file./* Name of directory to retrieve your files from */ const filePath = 'docs'; export const run = async () => { try { /*load raw docs from the all files in the directory */ const directoryLoader = new DirectoryLoader(filePath, { '.pdf': (path) => new CustomPDFLoader(path), }); // const loader = new PDFLoader(filePath); const rawDocs = await directoryLoader.load();
There is a PR to improve the code.
The code reads the subdirectories recursively.
I found that deleting the extant pinecone vectors before re-ingesting solved most issues where it seemed to think that the original docs were still available.
However, it seems to be having issues chatting about my new documents; apparently it is reading each PDF page as a separate document and not remembering the page numbers, and then seems to get lost beyond about page 25 in an 80 page PDF, wherein it doesn't seem to be able to find results thereafter, claiming the information is ~'not in context'!?
Then too, I had originally hoped I could just use symbolic links, but apparently it somehow doesn't recognize those as valid, nor OSX alias's either!?
Hi, @GMAN6875! I'm Dosu, and I'm here to help the gpt4-pdf-chatbot-langchain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, you were having trouble uploading different PDFs to the program and were seeking guidance on how to differentiate between folders and properly ingest and interact with new PDFs. smtdev suggested that the latest commit should read all the PDF files inside the docs
directory, including the subfolders. They even provided code and a PR to improve the code. Additionally, DXXS mentioned that deleting the existing pinecone vectors before re-ingesting solved most issues. However, there are still some remaining issues with the program reading each PDF page as a separate document and not remembering page numbers. They also noted that symbolic links and OSX aliases are not recognized as valid.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the gpt4-pdf-chatbot-langchain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.
Thank you for your understanding, and please don't hesitate to reach out if you have any further questions or concerns!
Bot should probably consult developers to verify if 'still relevant', as I haven't seen mention of associated fixes yet.
Thank you for your response, @DXXS! We appreciate your input. Based on your comment, we will be closing this issue. If you have any further questions or concerns, please let us know.