data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

S3 Scan Source partition supplier creates partitions in memory and a failure causes no partitions to be created

Open graytaylor0 opened this issue 8 months ago • 3 comments

Is your feature request related to a problem? Please describe. As a user of s3 scan, I have a bucket with 100 million objects. The current s3 scan source is not able to handle this many objects, as it is bottlenecked by returning all objects as a list of partitions in the supplier, which can lead to out of memory errors. Additionally, if there are any failures in s3 scan supplier, no partitions will get created because all partitions are returned from the supplier before they are created in the coordination store.

Describe the solution you'd like I would like the PartitionSupplier functions to be able to pass partitions back to the source coordinator for creation. So as objects are found during a scan, instead of holding them all in memory, the call to create the partition would be made right after the object is found from scanning.

Describe alternatives you've considered (Optional) A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

graytaylor0 avatar Jun 06 '24 14:06 graytaylor0