DocsGPT Update Scripts to integrate AWS S3

What kind of change does this PR introduce? (Bug fix, feature, docs update, ...) This PR modified the ingest and open_ai_func files to integrate AWS S3 into DocsGPT
Why was this change needed? (You can also link to an open issue here) https://github.com/arc53/DocsGPT/issues/724
Other information:

Dec 25 '23 23:12 jaredbradley243

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
nextra-docsgpt	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Dec 25, 2023 11:26pm

Dec 25 '23 23:12 vercel[bot]

@jaredbradley243 is attempting to deploy a commit to the Arc53 Team on Vercel.

A member of the Team first needs to authorize it.

Dec 25 '23 23:12 vercel[bot]

Codecov Report

Attention: Patch coverage is 0% with 93 lines in your changes are missing coverage. Please review.

Project coverage is 19.55%. Comparing base (7f79363) to head (30f2171). Report is 180 commits behind head on main.

Files	Patch %	Lines
scripts/ingest.py	0.00%	59 Missing :warning:
scripts/parser/open_ai_func.py	0.00%	34 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #807      +/-   ##
==========================================
- Coverage   19.56%   19.55%   -0.01%     
==========================================
  Files          62       72      +10     
  Lines        2914     3340     +426     
==========================================
+ Hits          570      653      +83     
- Misses       2344     2687     +343

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Dec 25 '23 23:12 codecov[bot]

Script Functionality

Local Mode (default): Processes documents from local directories specified by the user.
S3 Mode (--s3):
- Downloads documents from an S3 bucket to a temporary local storage (s3_temp_storage).
- Processes these documents.
- Uploads the processed documents back to the S3 bucket.

Enabling S3 Storage

To enable S3 storage, use the --s3 flag when running the script.

Environment Variables: Set these variables in your .env file:
- S3_BUCKET: Name of your S3 bucket.
- S3_DOCUMENTS_FOLDER: Folder within the S3 bucket where your documents are stored. If left blank, all files in S3 bucket will be downloaded (except .faiss and .pkl).
- S3_SAVE_FOLDER: Folder within the S3 bucket in which you would like to save the vector files. Leave blank to use the root of the bucket.
Running the Script:
- python ingest.py ingest --s3

Enabling Role Assumption

If accessing an S3 bucket requires assuming an IAM role (e.g., for cross-account access), the script supports this through the --s3-assume flag and proper AWS configuration.

Environment Variable:

Add AWS_ASSUME_ROLE_PROFILE to your .env file with the name of the AWS profile for role assumption. Ex: AWS_ASSUME_ROLE_PROFILE="dev"

AWS Configuration:

Credentials File (~/.aws/credentials):

[default]
aws_access_key_id = YOUR_DEFAULT_ACCESS_KEY
aws_secret_access_key = YOUR_DEFAULT_SECRET_KEY

[iamadmin]
aws_access_key_id = EXAMPLEKEY123456
aws_secret_access_key = EXAMPLESECRETKEY123456

Config File (~/.aws/config):

[default]
region = us-west-2
output = json

[profile dev]
region = us-west-2
role_arn = arn:aws:iam::123456789012:role/YourRoleName
source_profile = iamadmin

Running the Script with Role Assumption:
- python your_script.py --s3 --s3-assume

This configuration allows the script to assume YourRoleName using the credentials from the iamadmin profile.

Note

Ensure that the IAM role (YourRoleName) has necessary permissions for accessing the specified S3 bucket.
The script will create a temporary local storage (s3_temp_storage) for processing S3 documents, which will be cleaned up after processing.

Dec 25 '23 23:12 jaredbradley243

To me, the following parameters look more logical to pass over command line parameters rather than as environment variables. S3_BUCKET S3_SAVE_FOLDER AWS_ASSUME_ROLE_PROFILE

Jan 21 '24 19:01 larinam

Thanks for the change requests, @larinam. I'll get to these soon!

Feb 09 '24 21:02 jaredbradley243

Working on these now.

Mar 22 '24 19:03 jaredbradley243

To me, the following parameters look more logical to pass over command line parameters rather than as environment variables. S3_BUCKET S3_SAVE_FOLDER AWS_ASSUME_ROLE_PROFILE

I added command line parameter support as well as env variables. 😁

Mar 22 '24 22:03 jaredbradley243

In general, according to the ticket, support for S3 should be added to the application as well. In this PR the changes are only in the scripts package. Please refer to the initial ticket for more information.

Mar 30 '24 10:03 larinam

To avoid passing s3 config through several calls, please consider using https://docs.pydantic.dev/latest/concepts/pydantic_settings/ The configuration is static and doesn't change through the lifecycle of the script, so please consider using static object to manage these settings.

Mar 30 '24 10:03 larinam