aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

Duplicates values in identity column when mixing redshift.copy and redshift.to_sql

Open laderjk opened this issue 1 year ago • 2 comments

Hello everyone! We are having an issue with duplicated values in an identity column when executing redshift.to_sql() after executing a redshift.copy(). When trying to insert more than 1000 records we use the COPY method and when trying to insert less than 1000 records we use the redshift.to_sql() method as is recommended in the docs.

This is the structure of the table:

create table if not exists dim_testing
(
    id BIGINT IDENTITY NOT NULL PRIMARY KEY,
    name varchar
);

This is the code in which we are able to reproduce the issue:

import random
import string

import awswrangler as wr
import pandas as pd

def generate_df(length):
    return pd.DataFrame([''.join(random.choices(string.ascii_letters + string.digits, k=30)) for _ in range(length)], columns=['name'])

df = generate_df(1500)
wr.redshift.copy(
    df=df,
    con=redshift.conn,
    path=f's3://s3_bucket_path/dim_testing/',
    schema='public',
    table='dim_testing',
    use_column_names=True,
    mode='append'
)

df = generate_df(500)
wr.redshift.to_sql(
    df,
    redshift.conn,
    schema='public',
    table='dim_testing',
    use_column_names=True,
    mode='append'
)

After executing that code if we look for duplicates by the identity column id, we get duplicated values:

select id, count(*)
from dim_testing
group by id
having count(*) > 1;

If we only use to_sql() or copy(), the issue does not happens. It happens when executing to_sql() after a copy(). Is this normal? Are we doing something wrong? Should we only chose one write method and not use the other?

Thanks!

laderjk avatar Oct 15 '24 16:10 laderjk

Hi @laderjk thanks for opening this - looks like a race condition. Can you try passing lock=True to both calls? This will lock the table line up all the updates/inserts.

kukushking avatar Nov 15 '24 11:11 kukushking

Hi, @kukushking I tried passing lock=True to both calls and I get the same result. =(

laderjk avatar Dec 02 '24 22:12 laderjk

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Nov 02 '25 18:11 github-actions[bot]