DocsGPT icon indicating copy to clipboard operation
DocsGPT copied to clipboard

πŸš€ Feature: Use S3- bucket as the vector store

Open jolo-dev opened this issue 2 years ago β€’ 17 comments

πŸ”– Feature description

The user should be able to add an S3 bucket for storing and accessing their documents.

🎀 Why is this feature needed ?

The documents are not just stored in the cloud but are also easier to share. That would reduce the storage usage on your hard drive.

✌️ How do you aim to achieve this?

In order to store documents in an S3, you can pass a variable S3_STORE=my-bucket-name via the .env file. However, if you are running the application on your local machine, you will need to provide AWS credentials. The good news is that you can choose how to provide these credentials: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

When running scripts, the result should be uploaded to the given S3-bucket.

The store in the application should access the documents from the S3-bucket.

It could look like this


from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import S3Loader

embeddings = OpenAIEmbeddings() 

if(os.getenv("S3_STORE")):
  loader = S3Loader(os.getenv("S3_STORE"))
  documents = loader.load()

faiss_index = FAISS.from_documents(documents, embeddings) 

# Save index to S3
faiss_index.save(os.getenv("S3_STORE") + "/faiss-index")

# Load index from S3 
faiss_index = FAISS.load(os.getenv("S3_STORE") + "/faiss-index")

πŸ”„οΈ Additional Information

No response

πŸ‘€ Have you spent some time to check if this feature request has been raised before?

  • [X] I checked and didn't find similar issue

Are you willing to submit PR?

Yes I am willing to submit a PR!

jolo-dev avatar Oct 29 '23 11:10 jolo-dev

@jolo-dev I see you're assigned to this issue. Is that because you created it, or are you currently working on it? If you're not working on it, I'd love to take this issue.

Thanks! Happy coding! 😊

jaredbradley243 avatar Nov 02 '23 16:11 jaredbradley243

@jaredbradley243 thanks for your interest. I am currently working on that but if you want you can assign this to you πŸ˜‰

jolo-dev avatar Nov 02 '23 16:11 jolo-dev

That's very kind, but If you're already working on it, keep going! πŸ˜ƒ

jaredbradley243 avatar Nov 02 '23 17:11 jaredbradley243

@jaredbradley243 No no. Really. Let me be your reviewer then ;)

jolo-dev avatar Nov 02 '23 17:11 jolo-dev

Hahah. It's gonna take me a bit of time to work on this. I'll need to re-familiarize myself with the codebase and I won't have time to get started for a week or two, but if everyone is alright with waiting, I'll happily take it off your hands!

jaredbradley243 avatar Nov 02 '23 18:11 jaredbradley243

@jaredbradley243 No worries! Ill check back in a few weeks on this

dartpain avatar Nov 03 '23 12:11 dartpain

@jaredbradley243 Any update on this ????

Rajesh983 avatar Dec 05 '23 10:12 Rajesh983

@jaredbradley243 Any update on this ????

Hey @Rajesh983. Sorry for the delay, I just saw your comment!

I have updated the script to allow S3 to be used as document and vector storage. If AWS credentials are detected in the .env file, the script will download documents from a given S3 bucket/folder and parse the documents. Once the documents are parsed, the resulting index.faiss and index.pkl files are saved back into the S3 bucket in the folder of the user's choosing.

However, I still need to implement AWS role assumption.

I had to take a break as I have an exam coming up, but I should finish the script soon!

If you'd like to preview my code and test it out early, let me know.

jaredbradley243 avatar Dec 09 '23 05:12 jaredbradley243

Hey quick question, I need this feature and since it isn't out yet I am planning on building it out myself just for my own use-case.

Do I need to save the pickle file locally first before uploading to S3? or is there a way to write the langchain.vectorstores.faiss.FAISS object (the vector store) straight into S3 as a pickle file? Apologies for my naΓ―vetΓ©.

fundmatch-dev avatar Dec 23 '23 03:12 fundmatch-dev

Hey quick question, I need this feature and since it isn't out yet I am planning on building it out myself just for my own use-case.

Do I need to save the pickle file locally first before uploading to S3? or is there a way to write the langchain.vectorstores.faiss.FAISS object (the vector store) straight into S3 as a pickle file? Apologies for my naΓ―vetΓ©.

Hey @fundmatch-dev!

I finished this feature yesterday, I'm just writing a readme for it! Would you like to test it out for me?

jaredbradley243 avatar Dec 23 '23 03:12 jaredbradley243

Hi @jaredbradley243 I just finished implementing it manually for myself, it seems to work. But hey I don't mind helping if you can just explain what I need to do! Want to hop on a call or?

fundmatch-dev avatar Dec 23 '23 03:12 fundmatch-dev

Hi @jaredbradley243 I just finished implementing it manually for myself, it seems to work. But hey I don't mind helping if you can just explain what I need to do! Want to hop on a call or?

I'm happy to hop on a call with you tomorrow, if you're free! (It's 8PM here in Los Angeles).

In the meantime, you can replace your ingest.py and open_ai_func.py files with these updated versions:

docsgpt.zip

And here are some instructions:

Script Functionality

  1. Local Mode (default): Processes documents from local directories specified by the user.
  2. S3 Mode (--s3):
    • Downloads documents from an S3 bucket to a temporary local storage (s3_temp_storage).
    • Processes these documents.
    • Uploads the processed documents back to the S3 bucket.

Enabling S3 Storage

To enable S3 storage, use the --s3 flag when running the script.

  1. Environment Variables: Set these variables in your .env file:

    • S3_BUCKET: Name of your S3 bucket.
    • S3_DOCUMENTS_FOLDER: Folder within the S3 bucket where your documents are stored. If left blank, all files in S3 bucket will be downloaded (except .faiss and .pkl).
    • S3_SAVE_FOLDER: Folder within the S3 bucket in which you would like to save the vector files. Leave blank to use the root of the bucket.
  2. Running the Script:

    • python ingest.py ingest --s3

Enabling Role Assumption

If accessing an S3 bucket requires assuming an IAM role (e.g., for cross-account access), the script supports this through the --s3-assume flag and proper AWS configuration.

  1. Environment Variable:
  • Add AWS_ASSUME_ROLE_PROFILE to your .env file with the name of the AWS profile for role assumption. Ex: AWS_ASSUME_ROLE_PROFILE="dev"
  1. AWS Configuration:
  • Credentials File (~/.aws/credentials):
    [default]
    aws_access_key_id = YOUR_DEFAULT_ACCESS_KEY
    aws_secret_access_key = YOUR_DEFAULT_SECRET_KEY
    
    [iamadmin]
    aws_access_key_id = EXAMPLEKEY123456
    aws_secret_access_key = EXAMPLESECRETKEY123456
    
  • Config File (~/.aws/config):
    [default]
    region = us-west-2
    output = json
    
    [profile dev]
    region = us-west-2
    role_arn = arn:aws:iam::123456789012:role/YourRoleName
    source_profile = iamadmin
    
  1. Running the Script with Role Assumption:

    • python your_script.py --s3 --s3-assume

This configuration allows the script to assume YourRoleName using the credentials from the iamadmin profile.

Note

  • Ensure that the IAM role (YourRoleName) has necessary permissions for accessing the specified S3 bucket.
  • The script will create a temporary local storage (s3_temp_storage) for processing S3 documents, which will be cleaned up after processing.

jaredbradley243 avatar Dec 23 '23 04:12 jaredbradley243

Let me know if you have any difficulty, or if you find the instructions difficult to follow! 😁

This seems to be a sought after feature, I'm glad I got the change to work on it!

jaredbradley243 avatar Dec 23 '23 04:12 jaredbradley243

Hey @jaredbradley243, Great work. But why is this completed? The PR is still open :D

jolo-dev avatar Dec 27 '23 16:12 jolo-dev

Hey @jaredbradley243, Great work. But why is this completed? The PR is still open :D

Thank you! Over excitement I blame on the holiday season. πŸ˜‚ Issue reopened.

jaredbradley243 avatar Dec 27 '23 23:12 jaredbradley243

Folks: What is the ETA of this feature Completion? This would allow stand-alone conversion of S3 documents into vector version right? Will we have a separate index/id for each document after the conversion? Trying to wrap head around it

bazooka720 avatar Jan 15 '24 16:01 bazooka720

Hi, I am trying to store my FAISS vectorstore in Azure blob storage. Is there any functionality present that can help me with that. Thanks,

pandey0039 avatar May 16 '24 10:05 pandey0039