graphrag icon indicating copy to clipboard operation
graphrag copied to clipboard

feat(aws): add s3 support to input, storage, output, cache, etc.

Open knguyen1 opened this issue 9 months ago • 6 comments

Description

This PR adds s3 integration to GraphRAG; support both AWS s3 and s3-like services (via endpoint_url; minio, etc.).

Related Issues

#1306

Proposed Changes

  • Add S3 pipeline storage implementation with full PipelineStorage interface support (graphrag/storage/s3_pipeline_storage.py)
  • Add S3 workflow callbacks for logging workflow events to S3 buckets (graphrag/callbacks/s3_workflow_callbacks.py)
  • Add S3 prompt loading capability for retrieving prompts directly from S3 buckets (graphrag/config/prompt_getter.py)
  • Add configuration support for S3 across all storage components (input, output, cache, reporting)
  • Add comprehensive documentation covering configuration, authentication options, and troubleshooting (docs/config/s3.md)
  • Add unit tests with mocked AWS services for all S3 components

Checklist

  • [x] I have tested these changes locally.
  • [x] I have reviewed the code changes.
  • [x] I have updated the documentation (if necessary).
  • [x] I have added appropriate unit tests (if applicable).

Additional Notes

  • Supports multiple authentication methods: explicit credentials, environment variables, AWS credential chain, and IAM roles
  • Compatible with S3-compatible storage services via configurable endpoint URLs
  • Implements lazy loading of S3 clients for improved performance
  • Includes proper error handling and logging for S3 operations
  • Storage paths are configurable via environment variables or YAML configuration
  • All S3 operations are thoroughly tested with mocked AWS services

knguyen1 avatar Mar 20 '25 14:03 knguyen1

@microsoft-github-policy-service agree

knguyen1 avatar Mar 20 '25 14:03 knguyen1

Can you add the option to enter the endpoint URL to the boto3 client as well so that storage to other platforms such as minIO is also possible through the S3 API?

Sirorororo avatar Apr 08 '25 03:04 Sirorororo

Can you add the option to enter the endpoint URL to the boto3 client as well so that storage to other platforms such as minIO is also possible through the S3 API?

Done: https://github.com/microsoft/graphrag/pull/1830/commits/f1fd55daf176cbe853127c625a034b8cdbe2061a

knguyen1 avatar Apr 09 '25 20:04 knguyen1

Please review @natoverse

knguyen1 avatar Apr 09 '25 20:04 knguyen1

@natoverse @AlonsoGuevara review please?

knguyen1 avatar Apr 24 '25 13:04 knguyen1

What is the status of this PR? We run out infra on AWS, so would be cool to have this functionality

qcloop avatar May 21 '25 21:05 qcloop

Unless you review this PR soon, I'm going to close without merging. I am now getting conflicts too numerous and too complex to resolve cleanly. @natoverse @AlonsoGuevara

knguyen1 avatar Jun 13 '25 09:06 knguyen1

Resolved conflicts and rebased: https://github.com/microsoft/graphrag/pull/1830/commits/40b0affffbe3d4f5049f95868470fac3fd8f07bf Moved s3 configs to StorageConfig class: https://github.com/microsoft/graphrag/pull/1830/commits/e8936365976171aa165e2425d1ead526f0176608 Update documentation: https://github.com/microsoft/graphrag/pull/1830/commits/980371e286890667fd9832eebc12e1037637cd41

@natoverse @AlonsoGuevara

knguyen1 avatar Jun 13 '25 14:06 knguyen1

Closing due to inactivity.

knguyen1 avatar Jun 26 '25 18:06 knguyen1