DocsGPT
DocsGPT copied to clipboard
π Feature: Use S3- bucket as the vector store
π Feature description
The user should be able to add an S3 bucket for storing and accessing their documents.
π€ Why is this feature needed ?
The documents are not just stored in the cloud but are also easier to share. That would reduce the storage usage on your hard drive.
βοΈ How do you aim to achieve this?
In order to store documents in an S3, you can pass a variable S3_STORE=my-bucket-name via the .env file. However, if you are running the application on your local machine, you will need to provide AWS credentials. The good news is that you can choose how to provide these credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
When running scripts, the result should be uploaded to the given S3-bucket.
The store in the application should access the documents from the S3-bucket.
It could look like this
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import S3Loader
embeddings = OpenAIEmbeddings()
if(os.getenv("S3_STORE")):
loader = S3Loader(os.getenv("S3_STORE"))
documents = loader.load()
faiss_index = FAISS.from_documents(documents, embeddings)
# Save index to S3
faiss_index.save(os.getenv("S3_STORE") + "/faiss-index")
# Load index from S3
faiss_index = FAISS.load(os.getenv("S3_STORE") + "/faiss-index")
ποΈ Additional Information
No response
π Have you spent some time to check if this feature request has been raised before?
- [X] I checked and didn't find similar issue
Are you willing to submit PR?
Yes I am willing to submit a PR!
@jolo-dev I see you're assigned to this issue. Is that because you created it, or are you currently working on it? If you're not working on it, I'd love to take this issue.
Thanks! Happy coding! π
@jaredbradley243 thanks for your interest. I am currently working on that but if you want you can assign this to you π
That's very kind, but If you're already working on it, keep going! π
@jaredbradley243 No no. Really. Let me be your reviewer then ;)
Hahah. It's gonna take me a bit of time to work on this. I'll need to re-familiarize myself with the codebase and I won't have time to get started for a week or two, but if everyone is alright with waiting, I'll happily take it off your hands!
@jaredbradley243 No worries! Ill check back in a few weeks on this
@jaredbradley243 Any update on this ????
@jaredbradley243 Any update on this ????
Hey @Rajesh983. Sorry for the delay, I just saw your comment!
I have updated the script to allow S3 to be used as document and vector storage. If AWS credentials are detected in the .env file, the script will download documents from a given S3 bucket/folder and parse the documents. Once the documents are parsed, the resulting index.faiss and index.pkl files are saved back into the S3 bucket in the folder of the user's choosing.
However, I still need to implement AWS role assumption.
I had to take a break as I have an exam coming up, but I should finish the script soon!
If you'd like to preview my code and test it out early, let me know.
Hey quick question, I need this feature and since it isn't out yet I am planning on building it out myself just for my own use-case.
Do I need to save the pickle file locally first before uploading to S3? or is there a way to write the langchain.vectorstores.faiss.FAISS object (the vector store) straight into S3 as a pickle file? Apologies for my naΓ―vetΓ©.
Hey quick question, I need this feature and since it isn't out yet I am planning on building it out myself just for my own use-case.
Do I need to save the pickle file locally first before uploading to S3? or is there a way to write the langchain.vectorstores.faiss.FAISS object (the vector store) straight into S3 as a pickle file? Apologies for my naΓ―vetΓ©.
Hey @fundmatch-dev!
I finished this feature yesterday, I'm just writing a readme for it! Would you like to test it out for me?
Hi @jaredbradley243 I just finished implementing it manually for myself, it seems to work. But hey I don't mind helping if you can just explain what I need to do! Want to hop on a call or?
Hi @jaredbradley243 I just finished implementing it manually for myself, it seems to work. But hey I don't mind helping if you can just explain what I need to do! Want to hop on a call or?
I'm happy to hop on a call with you tomorrow, if you're free! (It's 8PM here in Los Angeles).
In the meantime, you can replace your ingest.py and open_ai_func.py files with these updated versions:
And here are some instructions:
Script Functionality
- Local Mode (
default): Processes documents from local directories specified by the user. - S3 Mode (
--s3):- Downloads documents from an S3 bucket to a temporary local storage (
s3_temp_storage). - Processes these documents.
- Uploads the processed documents back to the S3 bucket.
- Downloads documents from an S3 bucket to a temporary local storage (
Enabling S3 Storage
To enable S3 storage, use the --s3 flag when running the script.
-
Environment Variables: Set these variables in your
.envfile:S3_BUCKET: Name of your S3 bucket.S3_DOCUMENTS_FOLDER: Folder within the S3 bucket where your documents are stored. If left blank, all files in S3 bucket will be downloaded (except .faiss and .pkl).S3_SAVE_FOLDER: Folder within the S3 bucket in which you would like to save the vector files. Leave blank to use the root of the bucket.
-
Running the Script:
- python ingest.py ingest --s3
Enabling Role Assumption
If accessing an S3 bucket requires assuming an IAM role (e.g., for cross-account access), the script supports this through the --s3-assume flag and proper AWS configuration.
- Environment Variable:
- Add
AWS_ASSUME_ROLE_PROFILEto your.envfile with the name of the AWS profile for role assumption. Ex:AWS_ASSUME_ROLE_PROFILE="dev"
- AWS Configuration:
- Credentials File (
~/.aws/credentials):[default] aws_access_key_id = YOUR_DEFAULT_ACCESS_KEY aws_secret_access_key = YOUR_DEFAULT_SECRET_KEY [iamadmin] aws_access_key_id = EXAMPLEKEY123456 aws_secret_access_key = EXAMPLESECRETKEY123456 - Config File (
~/.aws/config):[default] region = us-west-2 output = json [profile dev] region = us-west-2 role_arn = arn:aws:iam::123456789012:role/YourRoleName source_profile = iamadmin
-
Running the Script with Role Assumption:
- python your_script.py --s3 --s3-assume
This configuration allows the script to assume YourRoleName using the credentials from the iamadmin profile.
Note
- Ensure that the IAM role (
YourRoleName) has necessary permissions for accessing the specified S3 bucket. - The script will create a temporary local storage (
s3_temp_storage) for processing S3 documents, which will be cleaned up after processing.
Let me know if you have any difficulty, or if you find the instructions difficult to follow! π
This seems to be a sought after feature, I'm glad I got the change to work on it!
Hey @jaredbradley243, Great work. But why is this completed? The PR is still open :D
Hey @jaredbradley243, Great work. But why is this completed? The PR is still open :D
Thank you! Over excitement I blame on the holiday season. π Issue reopened.
Folks: What is the ETA of this feature Completion? This would allow stand-alone conversion of S3 documents into vector version right? Will we have a separate index/id for each document after the conversion? Trying to wrap head around it
Hi, I am trying to store my FAISS vectorstore in Azure blob storage. Is there any functionality present that can help me with that. Thanks,