pandas icon indicating copy to clipboard operation
pandas copied to clipboard

When using to_sql(), continue if duplicate primary keys are detected?

Open rosstripi opened this issue 7 years ago • 33 comments

Code Sample, a copy-pastable example if possible

df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)

Problem description

I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql() completely stops executing if even one duplicate is detected.

It would make sense for to_sql(if_exists='append') to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.

Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists option like 'append_skipdupes'?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: English_United States.1252

pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.12.0 scipy: None statsmodels: None xarray: None IPython: 5.3.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 httplib2: None apiclient: None sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.5 boto: None pandas_datareader: None

rosstripi avatar Apr 12 '17 23:04 rosstripi

This should also support the "on duplicate update" mode as well.

rockg avatar Apr 13 '17 02:04 rockg

@rosstripi I think the idea to have this would certainly be accepted, but AFAIK the main bottleneck is an implementation for this using sql/sqlalchemy in a flavor agnostic way. Some exploration how this could be done is certainly welcome!

jorisvandenbossche avatar Apr 13 '17 09:04 jorisvandenbossche

Hi did you figure out any workaround for this? Please let me know

muniswamy89 avatar Jun 06 '18 16:06 muniswamy89

Any update on this implementation?

I am now facing this problem with PostgreSQL and SQLAlchemy and would love to have that "on duplicate update".

Thanks for the work

AlvaroPica avatar Dec 10 '18 11:12 AlvaroPica

A workaround would be to remove the unique index in the database:

sqlquery="ALTER 'TABLE DATABASE'.'TABLE' DROP INDEX 'idx_name'" afterwards df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False) can be executed.

Just let your MySQL Server add the index again and drop the duplicates. sqlquery="ALTER IGNORE TABLE 'DATABASE'.'TABLE' ADD UNIQUE INDEX 'idx_name' ('column_name1' ASC, 'column_name2' ASC, 'column_name3' '[ASC | DESC]')"

Depending on your specific application, this can be helpful. Anyway if_exists option like append_skipdupes would be much better.

valewyss avatar Apr 16 '19 13:04 valewyss

append_skipdupes would be the perfect way to handle this.

cgi1 avatar May 14 '19 21:05 cgi1

yes, append_skipdupes +1

macdet avatar Jun 28 '19 19:06 macdet

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here's the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

jtkiley avatar Aug 06 '19 14:08 jtkiley

an other workaround with MariaDb and MySql : df.to_csv("test.csv") then use : LOAD DATA INFILE 'test.csv' IGNORE INTO TABLE mytable or LOAD DATA INFILE 'test.csv' REPLACE INTO TABLE mytable.

LOAD DATA is very faster than INSERT.

complete code:

csv_path = str(Path(application_path) / "tmp" / "tmp.csv").replace("\\", "\\\\")
df.to_csv(csv_path, index=False, sep='\t', quotechar="'", na_rep=r'\N')
rq = """LOAD DATA LOCAL INFILE '{file_path}' REPLACE INTO TABLE {db}.{db_table}
        LINES TERMINATED BY '\\r\\n'
        IGNORE 1 LINES
         ({col});
        """.format(db=db,
                   file_path=csv_path,
                   db_table=table_name,
                   col=",".join(df.columns.tolist()))

netchose avatar Oct 24 '19 08:10 netchose

I believe this is being addressed in #29636 with the upsert_ignore argument, which addresses #14553.

kjford avatar Dec 09 '19 05:12 kjford

append_skipdupes +1

iveteran avatar Jun 20 '20 14:06 iveteran

+1 for append_skipdupes

grantog avatar Aug 23 '20 16:08 grantog

Agree 'append_skipdupes' should be added.

Arham-Aalam avatar Aug 26 '20 08:08 Arham-Aalam

Yes, please. 'append_skipdupes' should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows.

rahullak avatar Sep 21 '20 03:09 rahullak

+1 for append_skipdupes

devashishnyati avatar Oct 30 '20 22:10 devashishnyati

append_skipdupes +1

rishabh-vij avatar Nov 12 '20 16:11 rishabh-vij

append_skipdupes +1

mc55boy avatar Nov 21 '20 20:11 mc55boy

+1 for append_skipdupes

IsraaMa avatar Nov 29 '20 01:11 IsraaMa

Meantime you can use this https://pypi.org/project/pangres/

BuSHari avatar Nov 29 '20 09:11 BuSHari

+1 for append_skipdupes

kxbin avatar Jan 19 '21 09:01 kxbin

+1 for append_skipdupes

frostless avatar Mar 25 '21 23:03 frostless

+1 for append_skipdupes. IMO, an option to update the duplicates would also be nice. Perhaps append_updatedupes.

singhal2 avatar Apr 07 '21 16:04 singhal2

+1

tombohub avatar Apr 14 '21 04:04 tombohub

I have made small script for my use to allow INSERT IGNORE in mysql:

    def save_dataframe(self, df: pd.DataFrame, table: str):
        '''
        Save dataframe to the database. 
        Index is saved if it has name. If it's None it will not be saved.
        It implements INSERT IGNORE when inserting rows into the MySQL table.
        Table needs to exist before. 

        Arguments:
            df {pd.DataFrame} -- dataframe to save
            table {str} -- name of the db table
        '''
        if df.index.name is None:
            save_index = False
        else:
            save_index = True

        self._insert_conflict_ignore(df=df, table=table, index=save_index)

   
    def _insert_conflict_ignore(self, df: pd.DataFrame, table: str, index: bool):
        """
        Saves dataframe to the MySQL database with 'INSERT IGNORE' query.
        
        First it uses pandas.to_sql to save to temporary table.
        After that it uses SQL to transfer the data to destination table, matching the columns.
        Destination table needs to exist already. 
        Final step is deleting the temporary table.

        Parameters
        ----------
        df : pd.DataFrame
            dataframe to save
        table : str
            destination table name
        """
        # generate random table name for concurrent writing
        temp_table = ''.join(random.choice(string.ascii_letters) for i in range(10))
        try:
            df.to_sql(temp_table, self.conn, index=index)
            columns = self._table_column_names(table=temp_table)
            insert_query = f'INSERT IGNORE INTO {table}({columns}) SELECT {columns} FROM `{temp_table}`'
            self.conn.execute(insert_query)
        except Exception as e:
            print(e)        

        # drop temp table
        drop_query = f'DROP TABLE IF EXISTS `{temp_table}`'
        self.conn.execute(drop_query)


    def _table_column_names(self, table: str) -> str:
        """
        Get column names from database table

        Parameters
        ----------
        table : str
            name of the table

        Returns
        -------
        str
            names of columns as a string so we can interpolate into the SQL queries
        """
        query = f"SELECT column_name FROM information_schema.columns WHERE table_name = '{table}'"
        rows = self.conn.execute(query)
        dirty_names = [i[0] for i in rows]
        clean_names = '`' + '`, `'.join(map(str, dirty_names)) + '`'
        return clean_names

https://gist.github.com/tombohub/0c666583c48c1686c736ae2eb76cb2ea

tombohub avatar Apr 20 '21 21:04 tombohub

+1 for append_skipdupes

tinglinliu avatar Jan 12 '22 05:01 tinglinliu

Rather than upvoting this issue which already has a lot of votes, someone could help with this pr: https://github.com/pandas-dev/pandas/pull/29636

hyamanieu avatar Jan 12 '22 06:01 hyamanieu

Instead of skipping duplicates an option to choose between raise, ignore and replace would be even better. This way you can choose to have an exception raised, skip the duplicate or have the duplicate row removed and the new data inserted.

Mausy5043 avatar Apr 30 '22 13:04 Mausy5043

Agreed that it would be good to be able to deal with this with options in df.to_sql().

Here's the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

Is there a postgresql equivalent for this?

benboughton1 avatar May 08 '22 13:05 benboughton1

Agreed that it would be good to be able to deal with this with options in df.to_sql(). Here's the workaround I use in sqlite:

CREATE TABLE IF NOT EXISTS my_table_name (
    some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
    ...

Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.

Is there a postgresql equivalent for this?

Unfortunately, I couldn't find an equivalent for this on PostgreSQL when creating the table. (You can use this in insert or update commands but that's not the case here)

keivanipchihagh avatar May 20 '22 07:05 keivanipchihagh

A problem not solved from 2017 to 2022?
if_exists operates on table level, an extra keyword arg for 'skip_duplicates' is also acceptable

redreamality avatar Jun 23 '22 10:06 redreamality