pandas When using to_sql(), continue if duplicate primary keys are detected?

When using to_sql(), continue if duplicate primary keys are detected?

Open rosstripi opened this issue 7 years ago • 33 comments

Code Sample, a copy-pastable example if possible

df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)

Problem description

I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql() completely stops executing if even one duplicate is detected.

It would make sense for to_sql(if_exists='append') to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.

Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists option like 'append_skipdupes'?

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: English_United States.1252

pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.12.0 scipy: None statsmodels: None xarray: None IPython: 5.3.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 httplib2: None apiclient: None sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.5 boto: None pandas_datareader: None

Apr 12 '17 23:04 rosstripi

This should also support the "on duplicate update" mode as well.

Apr 13 '17 02:04 rockg

@rosstripi I think the idea to have this would certainly be accepted, but AFAIK the main bottleneck is an implementation for this using sql/sqlalchemy in a flavor agnostic way. Some exploration how this could be done is certainly welcome!

Apr 13 '17 09:04 jorisvandenbossche

Hi did you figure out any workaround for this? Please let me know

Jun 06 '18 16:06 muniswamy89

Any update on this implementation?

I am now facing this problem with PostgreSQL and SQLAlchemy and would love to have that "on duplicate update".

Thanks for the work

Dec 10 '18 11:12 AlvaroPica

A workaround would be to remove the unique index in the database:

sqlquery="ALTER 'TABLE DATABASE'.'TABLE' DROP INDEX 'idx_name'" afterwards df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False) can be executed.

Just let your MySQL Server add the index again and drop the duplicates. sqlquery="ALTER IGNORE TABLE 'DATABASE'.'TABLE' ADD UNIQUE INDEX 'idx_name' ('column_name1' ASC, 'column_name2' ASC, 'column_name3' '[ASC | DESC]')"

Depending on your specific application, this can be helpful. Anyway if_exists option like append_skipdupes would be much better.

Apr 16 '19 13:04 valewyss

append_skipdupes would be the perfect way to handle this.

May 14 '19 21:05 cgi1

yes, append_skipdupes +1

Jun 28 '19 19:06 macdet

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here's the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

Aug 06 '19 14:08 jtkiley

an other workaround with MariaDb and MySql : df.to_csv("test.csv") then use : LOAD DATA INFILE 'test.csv' IGNORE INTO TABLE mytable or LOAD DATA INFILE 'test.csv' REPLACE INTO TABLE mytable.

LOAD DATA is very faster than INSERT.

complete code:

csv_path = str(Path(application_path) / "tmp" / "tmp.csv").replace("\\", "\\\\")
df.to_csv(csv_path, index=False, sep='\t', quotechar="'", na_rep=r'\N')
rq = """LOAD DATA LOCAL INFILE '{file_path}' REPLACE INTO TABLE {db}.{db_table}
        LINES TERMINATED BY '\\r\\n'
        IGNORE 1 LINES
         ({col});
        """.format(db=db,
                   file_path=csv_path,
                   db_table=table_name,
                   col=",".join(df.columns.tolist()))

Oct 24 '19 08:10 netchose

I believe this is being addressed in #29636 with the upsert_ignore argument, which addresses #14553.

Dec 09 '19 05:12 kjford

append_skipdupes +1

Jun 20 '20 14:06 iveteran

+1 for append_skipdupes

Aug 23 '20 16:08 grantog

Agree 'append_skipdupes' should be added.

Aug 26 '20 08:08 Arham-Aalam

Yes, please. 'append_skipdupes' should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows.

Sep 21 '20 03:09 rahullak

+1 for append_skipdupes

Oct 30 '20 22:10 devashishnyati

append_skipdupes +1

Nov 12 '20 16:11 rishabh-vij

append_skipdupes +1

Nov 21 '20 20:11 mc55boy

+1 for append_skipdupes

Nov 29 '20 01:11 IsraaMa

Meantime you can use this https://pypi.org/project/pangres/

Nov 29 '20 09:11 BuSHari

+1 for append_skipdupes

Jan 19 '21 09:01 kxbin

+1 for append_skipdupes

Mar 25 '21 23:03 frostless

+1 for append_skipdupes. IMO, an option to update the duplicates would also be nice. Perhaps append_updatedupes.

Apr 07 '21 16:04 singhal2

Apr 14 '21 04:04 tombohub

I have made small script for my use to allow INSERT IGNORE in mysql:

    def save_dataframe(self, df: pd.DataFrame, table: str):
        '''
        Save dataframe to the database. 
        Index is saved if it has name. If it's None it will not be saved.
        It implements INSERT IGNORE when inserting rows into the MySQL table.
        Table needs to exist before. 

        Arguments:
            df {pd.DataFrame} -- dataframe to save
            table {str} -- name of the db table
        '''
        if df.index.name is None:
            save_index = False
        else:
            save_index = True

        self._insert_conflict_ignore(df=df, table=table, index=save_index)

   
    def _insert_conflict_ignore(self, df: pd.DataFrame, table: str, index: bool):
        """
        Saves dataframe to the MySQL database with 'INSERT IGNORE' query.
        
        First it uses pandas.to_sql to save to temporary table.
        After that it uses SQL to transfer the data to destination table, matching the columns.
        Destination table needs to exist already. 
        Final step is deleting the temporary table.

        Parameters
        ----------
        df : pd.DataFrame
            dataframe to save
        table : str
            destination table name
        """
        # generate random table name for concurrent writing
        temp_table = ''.join(random.choice(string.ascii_letters) for i in range(10))
        try:
            df.to_sql(temp_table, self.conn, index=index)
            columns = self._table_column_names(table=temp_table)
            insert_query = f'INSERT IGNORE INTO {table}({columns}) SELECT {columns} FROM `{temp_table}`'
            self.conn.execute(insert_query)
        except Exception as e:
            print(e)        

        # drop temp table
        drop_query = f'DROP TABLE IF EXISTS `{temp_table}`'
        self.conn.execute(drop_query)


    def _table_column_names(self, table: str) -> str:
        """
        Get column names from database table

        Parameters
        ----------
        table : str
            name of the table

        Returns
        -------
        str
            names of columns as a string so we can interpolate into the SQL queries
        """
        query = f"SELECT column_name FROM information_schema.columns WHERE table_name = '{table}'"
        rows = self.conn.execute(query)
        dirty_names = [i[0] for i in rows]
        clean_names = '`' + '`, `'.join(map(str, dirty_names)) + '`'
        return clean_names

https://gist.github.com/tombohub/0c666583c48c1686c736ae2eb76cb2ea

Apr 20 '21 21:04 tombohub

+1 for append_skipdupes

Jan 12 '22 05:01 tinglinliu

Rather than upvoting this issue which already has a lot of votes, someone could help with this pr: https://github.com/pandas-dev/pandas/pull/29636

Jan 12 '22 06:01 hyamanieu

Instead of skipping duplicates an option to choose between raise, ignore and replace would be even better. This way you can choose to have an exception raised, skip the duplicate or have the duplicate row removed and the new data inserted.

Apr 30 '22 13:04 Mausy5043

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here's the workaround I use in sqlite:
CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...
Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

Is there a postgresql equivalent for this?

May 08 '22 13:05 benboughton1

Agreed that it would be good to be able to deal with this with options in df.to_sql(). Here's the workaround I use in sqlite:
CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...
Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.
Is there a postgresql equivalent for this?

Unfortunately, I couldn't find an equivalent for this on PostgreSQL when creating the table. (You can use this in insert or update commands but that's not the case here)

May 20 '22 07:05 keivanipchihagh

A problem not solved from 2017 to 2022?
if_exists operates on table level, an extra keyword arg for 'skip_duplicates' is also acceptable

Jun 23 '22 10:06 redreamality

pandas pandas copied to clipboard

When using to_sql(), continue if duplicate primary keys are detected?

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

pandas
pandas copied to clipboard

Output of `pd.show_versions()`