aws-sdk-pandas Partitioning using wrnagler slowed down the querying process

Partitioning using wrnagler slowed down the querying process

Open pparsineja-chwy opened this issue 3 years ago • 0 comments

I had a data set uploaded onto S3 and it was not partitioned (and also not a dataset=true; no meta data). I could suucessfully crawl the data using Glue Crawler to create a table and querying data using Athena.

I then tried to partition the data using

res = wr.s3.to_parquet( df=df1_sub, path="xxx", dataset=True, mode="append", partition_cols=["product_part_number"], use_threads=True, concurrent_partitioning=True ) and then did the same Glue Crawling.

So I ended up two tables, Table A with data with no partition and table B, with dataset created above. The two table have the same number of rows and data in them is identical, except, table B has a partitioned column.

Querying Table A is "much" faster than Table B. Basically select count(1) from Table B takes FOREVER.

How can improve this?

Aug 08 '22 21:08 pparsineja-chwy

aws-sdk-pandas aws-sdk-pandas copied to clipboard

Partitioning using wrnagler slowed down the querying process

aws-sdk-pandas
aws-sdk-pandas copied to clipboard