suzieq
suzieq copied to clipboard
Evaluate/add option to store parquet files to S3
Problem
Scaling up Suzieq would be easier if the parquet files could be stored to S3. For example the pollers could be geographically distributed to different sites while the data could be central.
Proposed feature
Evaluate/add option to store parquet files to S3. There's also option to use 'S3 select' for filtering content in efficient way, https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html
From what I can see, you can read/write to s3 directly by prefacing the filename with s3://. So, for example:
dataset = ds.dataset("s3://ursa-labs-taxi-data/", partitioning=["year", "month"])
works. I suspect writing should be fine too. Could you please try this and let me know if it works as is?
If it does, we can figure out the content filtering next.