aws-sdk-pandas
aws-sdk-pandas copied to clipboard
Redundant rows when mode = append
I wrote a DataFrame (cols = product_part_number, date) to S3, using
res = wr.s3.to_parquet(
df=df1_sub,
path="xxx",
dataset=True,
mode="append",
partition_cols=["product_part_number"],
use_threads=True,
concurrent_partitioning=True
)
Later, I tried to write another set of DataFrame, with same columns, onto the same S3 path using above. I noticed, I have included some redundant rows in the DataFrame. Thus, I ended up appending those to my dataset (as expected; but not desired). In another try, I chose mode="overwrite", and it removed all my data in the S3 path, and published the new rows in there.
Is there any way that wrangler smartly understand the "redundant" rows and avoid writing them, and just append the new (different) rows to the S3 path?