suzieq icon indicating copy to clipboard operation
suzieq copied to clipboard

Evaluate/add option to store parquet files to S3

Open rasanentimo opened this issue 3 years ago • 1 comments

Problem

Scaling up Suzieq would be easier if the parquet files could be stored to S3. For example the pollers could be geographically distributed to different sites while the data could be central.

Proposed feature

Evaluate/add option to store parquet files to S3. There's also option to use 'S3 select' for filtering content in efficient way, https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

rasanentimo avatar Apr 30 '21 08:04 rasanentimo

From what I can see, you can read/write to s3 directly by prefacing the filename with s3://. So, for example:

dataset = ds.dataset("s3://ursa-labs-taxi-data/", partitioning=["year", "month"])

works. I suspect writing should be fine too. Could you please try this and let me know if it works as is?

If it does, we can figure out the content filtering next.

ddutt avatar May 31 '21 05:05 ddutt