DocsGPT icon indicating copy to clipboard operation
DocsGPT copied to clipboard

Update Scripts to integrate AWS S3

Open jaredbradley243 opened this issue 1 year ago • 10 comments

  • What kind of change does this PR introduce? (Bug fix, feature, docs update, ...) This PR modified the ingest and open_ai_func files to integrate AWS S3 into DocsGPT

  • Why was this change needed? (You can also link to an open issue here) https://github.com/arc53/DocsGPT/issues/724

  • Other information:

jaredbradley243 avatar Dec 25 '23 23:12 jaredbradley243

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nextra-docsgpt ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 25, 2023 11:26pm

vercel[bot] avatar Dec 25 '23 23:12 vercel[bot]

@jaredbradley243 is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Dec 25 '23 23:12 vercel[bot]

Codecov Report

Attention: Patch coverage is 0% with 93 lines in your changes are missing coverage. Please review.

Project coverage is 19.55%. Comparing base (7f79363) to head (30f2171). Report is 180 commits behind head on main.

Files Patch % Lines
scripts/ingest.py 0.00% 59 Missing :warning:
scripts/parser/open_ai_func.py 0.00% 34 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #807      +/-   ##
==========================================
- Coverage   19.56%   19.55%   -0.01%     
==========================================
  Files          62       72      +10     
  Lines        2914     3340     +426     
==========================================
+ Hits          570      653      +83     
- Misses       2344     2687     +343     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Dec 25 '23 23:12 codecov[bot]

Script Functionality

  1. Local Mode (default): Processes documents from local directories specified by the user.
  2. S3 Mode (--s3):
    • Downloads documents from an S3 bucket to a temporary local storage (s3_temp_storage).
    • Processes these documents.
    • Uploads the processed documents back to the S3 bucket.

Enabling S3 Storage

To enable S3 storage, use the --s3 flag when running the script.

  1. Environment Variables: Set these variables in your .env file:

    • S3_BUCKET: Name of your S3 bucket.
    • S3_DOCUMENTS_FOLDER: Folder within the S3 bucket where your documents are stored. If left blank, all files in S3 bucket will be downloaded (except .faiss and .pkl).
    • S3_SAVE_FOLDER: Folder within the S3 bucket in which you would like to save the vector files. Leave blank to use the root of the bucket.
  2. Running the Script:

    • python ingest.py ingest --s3

Enabling Role Assumption

If accessing an S3 bucket requires assuming an IAM role (e.g., for cross-account access), the script supports this through the --s3-assume flag and proper AWS configuration.

  1. Environment Variable:
  • Add AWS_ASSUME_ROLE_PROFILE to your .env file with the name of the AWS profile for role assumption. Ex: AWS_ASSUME_ROLE_PROFILE="dev"
  1. AWS Configuration:
  • Credentials File (~/.aws/credentials):
    [default]
    aws_access_key_id = YOUR_DEFAULT_ACCESS_KEY
    aws_secret_access_key = YOUR_DEFAULT_SECRET_KEY
    
    [iamadmin]
    aws_access_key_id = EXAMPLEKEY123456
    aws_secret_access_key = EXAMPLESECRETKEY123456
    
  • Config File (~/.aws/config):
    [default]
    region = us-west-2
    output = json
    
    [profile dev]
    region = us-west-2
    role_arn = arn:aws:iam::123456789012:role/YourRoleName
    source_profile = iamadmin
    
  1. Running the Script with Role Assumption:

    • python your_script.py --s3 --s3-assume

This configuration allows the script to assume YourRoleName using the credentials from the iamadmin profile.

Note

  • Ensure that the IAM role (YourRoleName) has necessary permissions for accessing the specified S3 bucket.
  • The script will create a temporary local storage (s3_temp_storage) for processing S3 documents, which will be cleaned up after processing.

jaredbradley243 avatar Dec 25 '23 23:12 jaredbradley243

To me, the following parameters look more logical to pass over command line parameters rather than as environment variables. S3_BUCKET S3_SAVE_FOLDER AWS_ASSUME_ROLE_PROFILE

larinam avatar Jan 21 '24 19:01 larinam

Thanks for the change requests, @larinam. I'll get to these soon!

jaredbradley243 avatar Feb 09 '24 21:02 jaredbradley243

Working on these now.

jaredbradley243 avatar Mar 22 '24 19:03 jaredbradley243

To me, the following parameters look more logical to pass over command line parameters rather than as environment variables. S3_BUCKET S3_SAVE_FOLDER AWS_ASSUME_ROLE_PROFILE

I added command line parameter support as well as env variables. 😁

jaredbradley243 avatar Mar 22 '24 22:03 jaredbradley243

In general, according to the ticket, support for S3 should be added to the application as well. In this PR the changes are only in the scripts package. Please refer to the initial ticket for more information.

larinam avatar Mar 30 '24 10:03 larinam

To avoid passing s3 config through several calls, please consider using https://docs.pydantic.dev/latest/concepts/pydantic_settings/ The configuration is static and doesn't change through the lifecycle of the script, so please consider using static object to manage these settings.

larinam avatar Mar 30 '24 10:03 larinam