connector-x icon indicating copy to clipboard operation
connector-x copied to clipboard

transfer data from s3 (or gcs/adls)

Open 99snowleopards opened this issue 3 years ago • 4 comments

Describe your feature request

I'd like to read data (parquet files) directly from s3.

Thank you for releasing this super helpful project - the introductory blog mentions a plan to support transferring data from s3 - is there any update on that?

Thanks,

99snowleopards avatar Sep 01 '22 19:09 99snowleopards

Hi @99snowleopards , thank you for bringing up this. Currently, we are focusing on loading data from relational databases like this discussion, but we do think s3 will be an important data source. It will be helpful for us to decide its priority if you can share your current tool to load data from s3 and its issue here or in that discussion!

wangxiaoying avatar Sep 05 '22 21:09 wangxiaoying

thank you for replying - I use pandas to read data into a df directly, or the aws CLI to cp the data and then read into pandas. the issue is that it's very slow

99snowleopards avatar Sep 06 '22 21:09 99snowleopards

Have you tried using arrow? This is the fastest way I know to fetch dataframe from s3. You can convert arrow to pandas afterwards.

wangxiaoying avatar Sep 07 '22 04:09 wangxiaoying

I use the pyarrow engine - I'll try using arrow separately and converting to pandas as per your suggestion, thanks again for replying

99snowleopards avatar Sep 08 '22 14:09 99snowleopards