aws-sdk-pandas Redundant rows when mode = append

Redundant rows when mode = append

Open pparsineja-chwy opened this issue 3 years ago • 0 comments

I wrote a DataFrame (cols = product_part_number, date) to S3, using

res = wr.s3.to_parquet(
    df=df1_sub,
    path="xxx",
    dataset=True,
    mode="append",
    partition_cols=["product_part_number"],
    use_threads=True,
    concurrent_partitioning=True
)

Later, I tried to write another set of DataFrame, with same columns, onto the same S3 path using above. I noticed, I have included some redundant rows in the DataFrame. Thus, I ended up appending those to my dataset (as expected; but not desired). In another try, I chose mode="overwrite", and it removed all my data in the S3 path, and published the new rows in there.

Is there any way that wrangler smartly understand the "redundant" rows and avoid writing them, and just append the new (different) rows to the S3 path?

Aug 08 '22 20:08 pparsineja-chwy

aws-sdk-pandas aws-sdk-pandas copied to clipboard

Redundant rows when mode = append

aws-sdk-pandas
aws-sdk-pandas copied to clipboard