gpt4-pdf-chatbot-langchain
gpt4-pdf-chatbot-langchain copied to clipboard
Support multiple PDFs and multiple topics
This PR allows users to add multiple subdirectories in docs
and to include multiple files in each subdirectory. run ingest
will automatically ingest all directories and all PDF files in those directories, and will create namespaces which match the subdirectory name. The user can then switch between topics on the home page.
Checked PRs to see if anyone made this yet. Thanks a lot for making! I haven't been able to run this however since run ingest returns an error: error [TypeError: t.replaceAll is not a function] (node:18260) UnhandledPromiseRejectionWarning (Use `node --trace-warnings ...` to show where the warning was created) (node:18260) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 42) (node:18260) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Let me know if I can give you more information to better reproduce. This error happened right after it printed Processing file
and the contents of the file split into chunks (but did not progress to the next files in the directory.)
Checked PRs to see if anyone made this yet. Thanks a lot for making! I haven't been able to run this however since run ingest returns an error:
error [TypeError: t.replaceAll is not a function] (node:18260) UnhandledPromiseRejectionWarning (Use `node --trace-warnings ...` to show where the warning was created) (node:18260) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 42) (node:18260) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Let me know if I can give you more information to better reproduce. This error happened right after it printedProcessing file
and the contents of the file split into chunks (but did not progress to the next files in the directory.)
Hey @Sofianel5 I just tried it on a different computer with a new index and everything worked as expected. I'm wondering what's going wrong for you, are you using pnpm? Did you put all of the PDFs in subfolders named with lowercase letters a-z and hyphens? Do your subfolder names match your namespaces in the topics?
I believe it is in the await PineconeStore.fromDocuments( index, chunk, embeddings, 'text', namespace, );
function call. The arguments are of type object object object string string
respectively. I'm just using npm, does pnpm matter? My subfolder is valid format and I added a definition of it to pinecone.ts but that doesn't seem to be referenced in ingest-data.ts.
Of course, the issue is that my node version was less than 15. Classic.
I added some code to handle adding new directories without reuploading all the data from the other unchanged directories using merkle trees. WIP on file-level changes but right now it will reupload an entire folder if and only if there is a change detected within that folder:
import fs from 'fs';
import path from 'path';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from 'langchain/embeddings';
import { PineconeStore } from 'langchain/vectorstores';
import { pinecone } from '@/utils/pinecone-client';
import { PDFLoader } from 'langchain/document_loaders';
import { PINECONE_INDEX_NAME } from '@/config/pinecone';
import { hashElement } from 'folder-hash';
function getPreviousHash(fulldir: string): Object {
try {
const dirname = path.basename(fulldir);
const data = fs.readFileSync(`./history/${dirname}.json`, 'utf8');
const parsed = JSON.parse(data);
console.log('parsed', typeof parsed, parsed)
return parsed;
} catch (err) {
console.error(err);
return {hash: ''};
}
}
function setPreviousHash(hash: Object, fulldir: string) {
try {
const data = JSON.stringify(hash);
const dirname = path.basename(fulldir);
fs.writeFileSync(`./history/${dirname}.json`, data);
} catch (err) {
console.error(err);
}
}
function recordFinishedLocally(directory: string) {
const options = {
folders: { exclude: ['.*', 'node_modules', 'test_coverage'] },
files: { include: ['*.pdf'] },
};
hashElement(directory, options)
.then((hash: any) => {
setPreviousHash(hash, directory);
})
.catch((error: any) => {
return console.error('hashing failed:', error);
});
}
function checkDiff(directory: string, callback: Function) {
const prevHash = getPreviousHash(directory);
const options = {
folders: { exclude: ['.*', 'node_modules', 'test_coverage'] },
files: { include: ['*.pdf'] },
};
hashElement(directory, options)
.then((hash: any) => {
const newHash = hash;
if (newHash.hash === prevHash.hash) {
console.log('no changes detected');
return false;
} else {
console.log('changes detected');
console.log('newHash', newHash.hash, 'prevHash', prevHash.hash, typeof prevHash);
return callback();
}
})
.catch((error: any) => {
return console.error('hashing failed:', error);
});
}
export const run = async () => {
try {
/* Load all directories */
const directories = fs
.readdirSync('./docs')
.filter((file) => {
return fs.statSync(path.join('./docs', file)).isDirectory();
})
.map((dir) => `./docs/${dir}`); // Add prefix 'docs/' to directory names
console.log('directories: ', directories);
for (const directory of directories) {
/* Load all PDF files in the directory */
checkDiff(directory, async () => {
const files = fs
.readdirSync(directory)
.filter((file) => path.extname(file) === '.pdf');
for (const file of files) {
console.log(`Processing file: ${file}`);
/* Load raw docs from the pdf file */
const filePath = path.join(directory, file);
const loader = new PDFLoader(filePath);
const rawDocs = await loader.load();
// console.log(rawDocs);
/* Split text into chunks */
const textSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const docs = await textSplitter.splitDocuments(rawDocs);
// console.log('split docs', docs);
// console.log('creating vector store...');
/*create and store the embeddings in the vectorStore*/
const embeddings = new OpenAIEmbeddings();
const index = pinecone.Index(PINECONE_INDEX_NAME);
const namespace = path.basename(directory); // use the directory name as the namespace
// console.log("creating vector store with namespace: ", namespace)
//embed the PDF documents
/* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
const chunkSize = 50;
for (let i = 0; i < docs.length; i += chunkSize) {
const chunk = docs.slice(i, i + chunkSize);
// await PineconeStore.fromDocuments(
// index,
// chunk,
// embeddings,
// 'text',
// namespace,
// );
}
console.log(`File ${file} processed`);
recordFinishedLocally(directory);
}
});
}
} catch (error) {
console.log('error', error);
throw new Error('Failed to ingest your data');
}
};
(async () => {
await run();
console.log('completed ingestion of all PDF files in all directories');
})();
Note you'll also need to create a history folder on the top level dir.
I added some code to handle adding new directories without reuploading all the data from the other unchanged directories using merkle trees. WIP on file-level changes but right now it will reupload an entire folder if and only if there is a change detected within that folder:
import fs from 'fs'; import path from 'path'; import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'; import { OpenAIEmbeddings } from 'langchain/embeddings'; import { PineconeStore } from 'langchain/vectorstores'; import { pinecone } from '@/utils/pinecone-client'; import { PDFLoader } from 'langchain/document_loaders'; import { PINECONE_INDEX_NAME } from '@/config/pinecone'; import { hashElement } from 'folder-hash'; function getPreviousHash(fulldir: string): Object { try { const dirname = path.basename(fulldir); const data = fs.readFileSync(`./history/${dirname}.json`, 'utf8'); const parsed = JSON.parse(data); console.log('parsed', typeof parsed, parsed) return parsed; } catch (err) { console.error(err); return {hash: ''}; } } function setPreviousHash(hash: Object, fulldir: string) { try { const data = JSON.stringify(hash); const dirname = path.basename(fulldir); fs.writeFileSync(`./history/${dirname}.json`, data); } catch (err) { console.error(err); } } function recordFinishedLocally(directory: string) { const options = { folders: { exclude: ['.*', 'node_modules', 'test_coverage'] }, files: { include: ['*.pdf'] }, }; hashElement(directory, options) .then((hash: any) => { setPreviousHash(hash, directory); }) .catch((error: any) => { return console.error('hashing failed:', error); }); } function checkDiff(directory: string, callback: Function) { const prevHash = getPreviousHash(directory); const options = { folders: { exclude: ['.*', 'node_modules', 'test_coverage'] }, files: { include: ['*.pdf'] }, }; hashElement(directory, options) .then((hash: any) => { const newHash = hash; if (newHash.hash === prevHash.hash) { console.log('no changes detected'); return false; } else { console.log('changes detected'); console.log('newHash', newHash.hash, 'prevHash', prevHash.hash, typeof prevHash); return callback(); } }) .catch((error: any) => { return console.error('hashing failed:', error); }); } export const run = async () => { try { /* Load all directories */ const directories = fs .readdirSync('./docs') .filter((file) => { return fs.statSync(path.join('./docs', file)).isDirectory(); }) .map((dir) => `./docs/${dir}`); // Add prefix 'docs/' to directory names console.log('directories: ', directories); for (const directory of directories) { /* Load all PDF files in the directory */ checkDiff(directory, async () => { const files = fs .readdirSync(directory) .filter((file) => path.extname(file) === '.pdf'); for (const file of files) { console.log(`Processing file: ${file}`); /* Load raw docs from the pdf file */ const filePath = path.join(directory, file); const loader = new PDFLoader(filePath); const rawDocs = await loader.load(); // console.log(rawDocs); /* Split text into chunks */ const textSplitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, chunkOverlap: 200, }); const docs = await textSplitter.splitDocuments(rawDocs); // console.log('split docs', docs); // console.log('creating vector store...'); /*create and store the embeddings in the vectorStore*/ const embeddings = new OpenAIEmbeddings(); const index = pinecone.Index(PINECONE_INDEX_NAME); const namespace = path.basename(directory); // use the directory name as the namespace // console.log("creating vector store with namespace: ", namespace) //embed the PDF documents /* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/ const chunkSize = 50; for (let i = 0; i < docs.length; i += chunkSize) { const chunk = docs.slice(i, i + chunkSize); // await PineconeStore.fromDocuments( // index, // chunk, // embeddings, // 'text', // namespace, // ); } console.log(`File ${file} processed`); recordFinishedLocally(directory); } }); } } catch (error) { console.log('error', error); throw new Error('Failed to ingest your data'); } }; (async () => { await run(); console.log('completed ingestion of all PDF files in all directories'); })();
Note you'll also need to create a history folder on the top level dir.
Is the diff check working on your local environment? The main challenge with managing vectorstores is updating and modifying files in a way that cost-effectively upserts without triggering a wholesale ingestion.
This PR allows users to add multiple subdirectories in
docs
and to include multiple files in each subdirectory.run ingest
will automatically ingest all directories and all PDF files in those directories, and will create namespaces which match the subdirectory name. The user can then switch between topics on the home page.
Thanks for this PR, in particular the namespace topics. I'm due to release a multiple recursive directory/file loader feature next week, using LangChain for the sake of simplicity and consistency of the current structure of the repo. If we can link that with the dynamic creation of namespaces you have proposed, that would be great.
Can you add some tests to your PR? Cheers
Should each directory have its own index?
Should each directory have its own index?
Pinecone index? The files within directories can be assigned to different namespaces
within an index.
oh, this is awesome! any chance for this to get merged?
Should each directory have its own index?
Pinecone index? The files within directories can be assigned to different
namespaces
within an index.
Even better.
So sorry guys, I quite new to Github and did quite alot of errors with Github, I have not tested and it's not aproved by me. Will look into if its possible to reverse github aprovals
Thanks for the PR, could you also add Chroma option in this PR too?
@mayooear can you please give updates about this PR on what is it possible to merge it? Thank you