Update Scripts to integrate AWS S3
-
What kind of change does this PR introduce? (Bug fix, feature, docs update, ...) This PR modified the ingest and open_ai_func files to integrate AWS S3 into DocsGPT
-
Why was this change needed? (You can also link to an open issue here) https://github.com/arc53/DocsGPT/issues/724
-
Other information:
The latest updates on your projects. Learn more about Vercel for Git ↗︎
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| nextra-docsgpt | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Dec 25, 2023 11:26pm |
@jaredbradley243 is attempting to deploy a commit to the Arc53 Team on Vercel.
A member of the Team first needs to authorize it.
Codecov Report
Attention: Patch coverage is 0% with 93 lines in your changes are missing coverage. Please review.
Project coverage is 19.55%. Comparing base (
7f79363) to head (30f2171). Report is 180 commits behind head on main.
| Files | Patch % | Lines |
|---|---|---|
| scripts/ingest.py | 0.00% | 59 Missing :warning: |
| scripts/parser/open_ai_func.py | 0.00% | 34 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #807 +/- ##
==========================================
- Coverage 19.56% 19.55% -0.01%
==========================================
Files 62 72 +10
Lines 2914 3340 +426
==========================================
+ Hits 570 653 +83
- Misses 2344 2687 +343
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Script Functionality
- Local Mode (
default): Processes documents from local directories specified by the user. - S3 Mode (
--s3):- Downloads documents from an S3 bucket to a temporary local storage (
s3_temp_storage). - Processes these documents.
- Uploads the processed documents back to the S3 bucket.
- Downloads documents from an S3 bucket to a temporary local storage (
Enabling S3 Storage
To enable S3 storage, use the --s3 flag when running the script.
-
Environment Variables: Set these variables in your
.envfile:S3_BUCKET: Name of your S3 bucket.S3_DOCUMENTS_FOLDER: Folder within the S3 bucket where your documents are stored. If left blank, all files in S3 bucket will be downloaded (except .faiss and .pkl).S3_SAVE_FOLDER: Folder within the S3 bucket in which you would like to save the vector files. Leave blank to use the root of the bucket.
-
Running the Script:
- python ingest.py ingest --s3
Enabling Role Assumption
If accessing an S3 bucket requires assuming an IAM role (e.g., for cross-account access), the script supports this through the --s3-assume flag and proper AWS configuration.
- Environment Variable:
- Add
AWS_ASSUME_ROLE_PROFILEto your.envfile with the name of the AWS profile for role assumption. Ex:AWS_ASSUME_ROLE_PROFILE="dev"
- AWS Configuration:
- Credentials File (
~/.aws/credentials):[default] aws_access_key_id = YOUR_DEFAULT_ACCESS_KEY aws_secret_access_key = YOUR_DEFAULT_SECRET_KEY [iamadmin] aws_access_key_id = EXAMPLEKEY123456 aws_secret_access_key = EXAMPLESECRETKEY123456 - Config File (
~/.aws/config):[default] region = us-west-2 output = json [profile dev] region = us-west-2 role_arn = arn:aws:iam::123456789012:role/YourRoleName source_profile = iamadmin
-
Running the Script with Role Assumption:
- python your_script.py --s3 --s3-assume
This configuration allows the script to assume YourRoleName using the credentials from the iamadmin profile.
Note
- Ensure that the IAM role (
YourRoleName) has necessary permissions for accessing the specified S3 bucket. - The script will create a temporary local storage (
s3_temp_storage) for processing S3 documents, which will be cleaned up after processing.
To me, the following parameters look more logical to pass over command line parameters rather than as environment variables. S3_BUCKET S3_SAVE_FOLDER AWS_ASSUME_ROLE_PROFILE
Thanks for the change requests, @larinam. I'll get to these soon!
Working on these now.
To me, the following parameters look more logical to pass over command line parameters rather than as environment variables. S3_BUCKET S3_SAVE_FOLDER AWS_ASSUME_ROLE_PROFILE
I added command line parameter support as well as env variables. 😁
In general, according to the ticket, support for S3 should be added to the application as well. In this PR the changes are only in the scripts package. Please refer to the initial ticket for more information.
To avoid passing s3 config through several calls, please consider using https://docs.pydantic.dev/latest/concepts/pydantic_settings/ The configuration is static and doesn't change through the lifecycle of the script, so please consider using static object to manage these settings.