gpt4-pdf-chatbot-langchain icon indicating copy to clipboard operation
gpt4-pdf-chatbot-langchain copied to clipboard

Support multiple PDFs and multiple topics

Open dalkommatt opened this issue 1 year ago • 14 comments

This PR allows users to add multiple subdirectories in docs and to include multiple files in each subdirectory. run ingest will automatically ingest all directories and all PDF files in those directories, and will create namespaces which match the subdirectory name. The user can then switch between topics on the home page.

dalkommatt avatar Mar 25 '23 00:03 dalkommatt

Checked PRs to see if anyone made this yet. Thanks a lot for making! I haven't been able to run this however since run ingest returns an error: error [TypeError: t.replaceAll is not a function] (node:18260) UnhandledPromiseRejectionWarning (Use `node --trace-warnings ...` to show where the warning was created) (node:18260) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 42) (node:18260) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code. Let me know if I can give you more information to better reproduce. This error happened right after it printed Processing file and the contents of the file split into chunks (but did not progress to the next files in the directory.)

Sofianel5 avatar Mar 25 '23 04:03 Sofianel5

Checked PRs to see if anyone made this yet. Thanks a lot for making! I haven't been able to run this however since run ingest returns an error: error [TypeError: t.replaceAll is not a function] (node:18260) UnhandledPromiseRejectionWarning (Use `node --trace-warnings ...` to show where the warning was created) (node:18260) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 42) (node:18260) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code. Let me know if I can give you more information to better reproduce. This error happened right after it printed Processing file and the contents of the file split into chunks (but did not progress to the next files in the directory.)

Hey @Sofianel5 I just tried it on a different computer with a new index and everything worked as expected. I'm wondering what's going wrong for you, are you using pnpm? Did you put all of the PDFs in subfolders named with lowercase letters a-z and hyphens? Do your subfolder names match your namespaces in the topics?

dalkommatt avatar Mar 25 '23 06:03 dalkommatt

I believe it is in the await PineconeStore.fromDocuments( index, chunk, embeddings, 'text', namespace, ); function call. The arguments are of type object object object string string respectively. I'm just using npm, does pnpm matter? My subfolder is valid format and I added a definition of it to pinecone.ts but that doesn't seem to be referenced in ingest-data.ts.

Sofianel5 avatar Mar 25 '23 07:03 Sofianel5

Of course, the issue is that my node version was less than 15. Classic.

Sofianel5 avatar Mar 25 '23 07:03 Sofianel5

I added some code to handle adding new directories without reuploading all the data from the other unchanged directories using merkle trees. WIP on file-level changes but right now it will reupload an entire folder if and only if there is a change detected within that folder:

import fs from 'fs';
import path from 'path';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings';
import { PineconeStore } from 'langchain/vectorstores';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders';
import { PINECONE_INDEX_NAME } from '@/config/pinecone';
import { hashElement } from 'folder-hash';

function getPreviousHash(fulldir: string): Object {
  try {
    const dirname = path.basename(fulldir);
    const data = fs.readFileSync(`./history/${dirname}.json`, 'utf8');
    const parsed = JSON.parse(data);
    console.log('parsed', typeof parsed, parsed)
    return parsed;
  } catch (err) {
    console.error(err);
    return {hash: ''};
  }
}

function setPreviousHash(hash: Object, fulldir: string) {
  try {
    const data = JSON.stringify(hash);
    const dirname = path.basename(fulldir);
    fs.writeFileSync(`./history/${dirname}.json`, data);
  } catch (err) {
    console.error(err);
  }
}

function recordFinishedLocally(directory: string) {
  const options = {
    folders: { exclude: ['.*', 'node_modules', 'test_coverage'] },
    files: { include: ['*.pdf'] },
  };
  hashElement(directory, options)
  .then((hash: any) => {
    setPreviousHash(hash, directory);
  })
  .catch((error: any) => {
    return console.error('hashing failed:', error);
  });
}

function checkDiff(directory: string, callback: Function) {
  const prevHash = getPreviousHash(directory);
  const options = {
    folders: { exclude: ['.*', 'node_modules', 'test_coverage'] },
    files: { include: ['*.pdf'] },
  };
  hashElement(directory, options)
  .then((hash: any) => {
    const newHash = hash;
    if (newHash.hash === prevHash.hash) {
      console.log('no changes detected');
      return false;
    } else {
      console.log('changes detected');
      console.log('newHash', newHash.hash, 'prevHash', prevHash.hash, typeof prevHash);
      return callback();
    }
  })
  .catch((error: any) => {
    return console.error('hashing failed:', error);
  });
}
 
export const run = async () => {
  try {
    /* Load all directories */
    const directories = fs
      .readdirSync('./docs')
      .filter((file) => {
        return fs.statSync(path.join('./docs', file)).isDirectory();
      })
      .map((dir) => `./docs/${dir}`); // Add prefix 'docs/' to directory names
    console.log('directories: ', directories);
    for (const directory of directories) {
      /* Load all PDF files in the directory */
      checkDiff(directory, async () => {
        const files = fs
          .readdirSync(directory)
          .filter((file) => path.extname(file) === '.pdf');

        for (const file of files) {
          console.log(`Processing file: ${file}`);

          /* Load raw docs from the pdf file */
          const filePath = path.join(directory, file);
          const loader = new PDFLoader(filePath);
          const rawDocs = await loader.load();

          // console.log(rawDocs);

          /* Split text into chunks */
          const textSplitter = new RecursiveCharacterTextSplitter({
            chunkSize: 1000,
            chunkOverlap: 200,
          });

          const docs = await textSplitter.splitDocuments(rawDocs);
          // console.log('split docs', docs);

          // console.log('creating vector store...');
          /*create and store the embeddings in the vectorStore*/
          const embeddings = new OpenAIEmbeddings();
          const index = pinecone.Index(PINECONE_INDEX_NAME); 
          const namespace = path.basename(directory); // use the directory name as the namespace 
          // console.log("creating vector store with namespace: ", namespace)
          //embed the PDF documents

          /* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
          const chunkSize = 50;
          for (let i = 0; i < docs.length; i += chunkSize) {
            const chunk = docs.slice(i, i + chunkSize);
            // await PineconeStore.fromDocuments(
            //   index,
            //   chunk,
            //   embeddings,
            //   'text',
            //   namespace,
            // );
          }

          console.log(`File ${file} processed`);
          recordFinishedLocally(directory);
        }
      });
    }
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('completed ingestion of all PDF files in all directories');
})();

Note you'll also need to create a history folder on the top level dir.

Sofianel5 avatar Mar 25 '23 08:03 Sofianel5

I added some code to handle adding new directories without reuploading all the data from the other unchanged directories using merkle trees. WIP on file-level changes but right now it will reupload an entire folder if and only if there is a change detected within that folder:

import fs from 'fs';
import path from 'path';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings';
import { PineconeStore } from 'langchain/vectorstores';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders';
import { PINECONE_INDEX_NAME } from '@/config/pinecone';
import { hashElement } from 'folder-hash';

function getPreviousHash(fulldir: string): Object {
  try {
    const dirname = path.basename(fulldir);
    const data = fs.readFileSync(`./history/${dirname}.json`, 'utf8');
    const parsed = JSON.parse(data);
    console.log('parsed', typeof parsed, parsed)
    return parsed;
  } catch (err) {
    console.error(err);
    return {hash: ''};
  }
}

function setPreviousHash(hash: Object, fulldir: string) {
  try {
    const data = JSON.stringify(hash);
    const dirname = path.basename(fulldir);
    fs.writeFileSync(`./history/${dirname}.json`, data);
  } catch (err) {
    console.error(err);
  }
}

function recordFinishedLocally(directory: string) {
  const options = {
    folders: { exclude: ['.*', 'node_modules', 'test_coverage'] },
    files: { include: ['*.pdf'] },
  };
  hashElement(directory, options)
  .then((hash: any) => {
    setPreviousHash(hash, directory);
  })
  .catch((error: any) => {
    return console.error('hashing failed:', error);
  });
}

function checkDiff(directory: string, callback: Function) {
  const prevHash = getPreviousHash(directory);
  const options = {
    folders: { exclude: ['.*', 'node_modules', 'test_coverage'] },
    files: { include: ['*.pdf'] },
  };
  hashElement(directory, options)
  .then((hash: any) => {
    const newHash = hash;
    if (newHash.hash === prevHash.hash) {
      console.log('no changes detected');
      return false;
    } else {
      console.log('changes detected');
      console.log('newHash', newHash.hash, 'prevHash', prevHash.hash, typeof prevHash);
      return callback();
    }
  })
  .catch((error: any) => {
    return console.error('hashing failed:', error);
  });
}
 
export const run = async () => {
  try {
    /* Load all directories */
    const directories = fs
      .readdirSync('./docs')
      .filter((file) => {
        return fs.statSync(path.join('./docs', file)).isDirectory();
      })
      .map((dir) => `./docs/${dir}`); // Add prefix 'docs/' to directory names
    console.log('directories: ', directories);
    for (const directory of directories) {
      /* Load all PDF files in the directory */
      checkDiff(directory, async () => {
        const files = fs
          .readdirSync(directory)
          .filter((file) => path.extname(file) === '.pdf');

        for (const file of files) {
          console.log(`Processing file: ${file}`);

          /* Load raw docs from the pdf file */
          const filePath = path.join(directory, file);
          const loader = new PDFLoader(filePath);
          const rawDocs = await loader.load();

          // console.log(rawDocs);

          /* Split text into chunks */
          const textSplitter = new RecursiveCharacterTextSplitter({
            chunkSize: 1000,
            chunkOverlap: 200,
          });

          const docs = await textSplitter.splitDocuments(rawDocs);
          // console.log('split docs', docs);

          // console.log('creating vector store...');
          /*create and store the embeddings in the vectorStore*/
          const embeddings = new OpenAIEmbeddings();
          const index = pinecone.Index(PINECONE_INDEX_NAME); 
          const namespace = path.basename(directory); // use the directory name as the namespace 
          // console.log("creating vector store with namespace: ", namespace)
          //embed the PDF documents

          /* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
          const chunkSize = 50;
          for (let i = 0; i < docs.length; i += chunkSize) {
            const chunk = docs.slice(i, i + chunkSize);
            // await PineconeStore.fromDocuments(
            //   index,
            //   chunk,
            //   embeddings,
            //   'text',
            //   namespace,
            // );
          }

          console.log(`File ${file} processed`);
          recordFinishedLocally(directory);
        }
      });
    }
  } catch (error) {
    console.log('error', error);
    throw new Error('Failed to ingest your data');
  }
};

(async () => {
  await run();
  console.log('completed ingestion of all PDF files in all directories');
})();

Note you'll also need to create a history folder on the top level dir.

Is the diff check working on your local environment? The main challenge with managing vectorstores is updating and modifying files in a way that cost-effectively upserts without triggering a wholesale ingestion.

mayooear avatar Mar 26 '23 04:03 mayooear

This PR allows users to add multiple subdirectories in docs and to include multiple files in each subdirectory. run ingest will automatically ingest all directories and all PDF files in those directories, and will create namespaces which match the subdirectory name. The user can then switch between topics on the home page.

Thanks for this PR, in particular the namespace topics. I'm due to release a multiple recursive directory/file loader feature next week, using LangChain for the sake of simplicity and consistency of the current structure of the repo. If we can link that with the dynamic creation of namespaces you have proposed, that would be great.

Can you add some tests to your PR? Cheers

mayooear avatar Mar 26 '23 04:03 mayooear

Should each directory have its own index?

sebastienfi avatar Mar 30 '23 05:03 sebastienfi

Should each directory have its own index?

Pinecone index? The files within directories can be assigned to different namespaces within an index.

mayooear avatar Apr 01 '23 13:04 mayooear

oh, this is awesome! any chance for this to get merged?

CyrusZei avatar Apr 02 '23 20:04 CyrusZei

Should each directory have its own index?

Pinecone index? The files within directories can be assigned to different namespaces within an index.

Even better.

sebastienfi avatar Apr 05 '23 07:04 sebastienfi

So sorry guys, I quite new to Github and did quite alot of errors with Github, I have not tested and it's not aproved by me. Will look into if its possible to reverse github aprovals

Chugarah avatar Apr 08 '23 14:04 Chugarah

Thanks for the PR, could you also add Chroma option in this PR too?

0rangeAppl3 avatar May 11 '23 04:05 0rangeAppl3

@mayooear can you please give updates about this PR on what is it possible to merge it? Thank you

magedhelmy1 avatar May 25 '23 07:05 magedhelmy1