feat(aws): add s3 support to input, storage, output, cache, etc.
Description
This PR adds s3 integration to GraphRAG; support both AWS s3 and s3-like services (via endpoint_url; minio, etc.).
Related Issues
#1306
Proposed Changes
- Add S3 pipeline storage implementation with full PipelineStorage interface support (
graphrag/storage/s3_pipeline_storage.py) - Add S3 workflow callbacks for logging workflow events to S3 buckets (
graphrag/callbacks/s3_workflow_callbacks.py) - Add S3 prompt loading capability for retrieving prompts directly from S3 buckets (
graphrag/config/prompt_getter.py) - Add configuration support for S3 across all storage components (input, output, cache, reporting)
- Add comprehensive documentation covering configuration, authentication options, and troubleshooting (
docs/config/s3.md) - Add unit tests with mocked AWS services for all S3 components
Checklist
- [x] I have tested these changes locally.
- [x] I have reviewed the code changes.
- [x] I have updated the documentation (if necessary).
- [x] I have added appropriate unit tests (if applicable).
Additional Notes
- Supports multiple authentication methods: explicit credentials, environment variables, AWS credential chain, and IAM roles
- Compatible with S3-compatible storage services via configurable endpoint URLs
- Implements lazy loading of S3 clients for improved performance
- Includes proper error handling and logging for S3 operations
- Storage paths are configurable via environment variables or YAML configuration
- All S3 operations are thoroughly tested with mocked AWS services
@microsoft-github-policy-service agree
Can you add the option to enter the endpoint URL to the boto3 client as well so that storage to other platforms such as minIO is also possible through the S3 API?
Can you add the option to enter the endpoint URL to the boto3 client as well so that storage to other platforms such as minIO is also possible through the S3 API?
Done: https://github.com/microsoft/graphrag/pull/1830/commits/f1fd55daf176cbe853127c625a034b8cdbe2061a
Please review @natoverse
@natoverse @AlonsoGuevara review please?
What is the status of this PR? We run out infra on AWS, so would be cool to have this functionality
Unless you review this PR soon, I'm going to close without merging. I am now getting conflicts too numerous and too complex to resolve cleanly. @natoverse @AlonsoGuevara
Resolved conflicts and rebased: https://github.com/microsoft/graphrag/pull/1830/commits/40b0affffbe3d4f5049f95868470fac3fd8f07bf
Moved s3 configs to StorageConfig class: https://github.com/microsoft/graphrag/pull/1830/commits/e8936365976171aa165e2425d1ead526f0176608
Update documentation: https://github.com/microsoft/graphrag/pull/1830/commits/980371e286890667fd9832eebc12e1037637cd41
@natoverse @AlonsoGuevara
Closing due to inactivity.