aws-sdk-pandas
aws-sdk-pandas copied to clipboard
Partitioning using wrnagler slowed down the querying process
I had a data set uploaded onto S3 and it was not partitioned (and also not a dataset=true; no meta data). I could suucessfully crawl the data using Glue Crawler to create a table and querying data using Athena.
I then tried to partition the data using
res = wr.s3.to_parquet( df=df1_sub, path="xxx", dataset=True, mode="append", partition_cols=["product_part_number"], use_threads=True, concurrent_partitioning=True ) and then did the same Glue Crawling.
So I ended up two tables, Table A with data with no partition and table B, with dataset created above. The two table have the same number of rows and data in them is identical, except, table B has a partitioned column.
Querying Table A is "much" faster than Table B. Basically
select count(1) from Table B
takes FOREVER.
How can improve this?