pandas
pandas copied to clipboard
When using to_sql(), continue if duplicate primary keys are detected?
Code Sample, a copy-pastable example if possible
df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)
Problem description
I am trying to append a large DataFrame to a SQL table. Some of the rows in the DataFrame are duplicates of those in the SQL table, some are not. But to_sql()
completely stops executing if even one duplicate is detected.
It would make sense for to_sql(if_exists='append')
to merely warn the user which rows had duplicate keys and just continue to add the new rows, not completely stop executing. For large datasets, you will likely have duplicates but want to ignore them.
Maybe add an argument to ignore duplicates and keep executing? Perhaps an additional if_exists
option like 'append_skipdupes'
?
Output of pd.show_versions()
pandas: 0.19.2 nose: None pip: 9.0.1 setuptools: 28.8.0 Cython: None numpy: 1.12.0 scipy: None statsmodels: None xarray: None IPython: 5.3.0 sphinx: None patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999999999 httplib2: None apiclient: None sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.5 boto: None pandas_datareader: None
This should also support the "on duplicate update" mode as well.
@rosstripi I think the idea to have this would certainly be accepted, but AFAIK the main bottleneck is an implementation for this using sql/sqlalchemy in a flavor agnostic way. Some exploration how this could be done is certainly welcome!
Hi did you figure out any workaround for this? Please let me know
Any update on this implementation?
I am now facing this problem with PostgreSQL and SQLAlchemy and would love to have that "on duplicate update".
Thanks for the work
A workaround would be to remove the unique index in the database:
sqlquery="ALTER 'TABLE DATABASE'.'TABLE' DROP INDEX 'idx_name'"
afterwards
df.to_sql('TableNameHere', engine, if_exists='append', chunksize=900, index=False)
can be executed.
Just let your MySQL Server add the index again and drop the duplicates.
sqlquery="ALTER IGNORE TABLE 'DATABASE'.'TABLE' ADD UNIQUE INDEX 'idx_name' ('column_name1' ASC, 'column_name2' ASC, 'column_name3' '[ASC | DESC]')"
Depending on your specific application, this can be helpful.
Anyway if_exists
option like append_skipdupes
would be much better.
append_skipdupes
would be the perfect way to handle this.
yes, append_skipdupes +1
Agreed that it would be good to be able to deal with this with options in df.to_sql()
.
Here's the workaround I use in sqlite:
CREATE TABLE IF NOT EXISTS my_table_name (
some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE,
...
Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.
an other workaround with MariaDb and MySql :
df.to_csv("test.csv")
then use :
LOAD DATA INFILE 'test.csv' IGNORE INTO TABLE mytable
or
LOAD DATA INFILE 'test.csv' REPLACE INTO TABLE mytable
.
LOAD DATA is very faster than INSERT.
complete code:
csv_path = str(Path(application_path) / "tmp" / "tmp.csv").replace("\\", "\\\\")
df.to_csv(csv_path, index=False, sep='\t', quotechar="'", na_rep=r'\N')
rq = """LOAD DATA LOCAL INFILE '{file_path}' REPLACE INTO TABLE {db}.{db_table}
LINES TERMINATED BY '\\r\\n'
IGNORE 1 LINES
({col});
""".format(db=db,
file_path=csv_path,
db_table=table_name,
col=",".join(df.columns.tolist()))
I believe this is being addressed in #29636 with the upsert_ignore
argument, which addresses #14553.
append_skipdupes +1
+1 for append_skipdupes
Agree 'append_skipdupes' should be added.
Yes, please. 'append_skipdupes' should be added and not only for the Primary Key column. If there are duplicates among other Unique columns also it should skip appending those new duplicate rows.
+1 for append_skipdupes
append_skipdupes +1
append_skipdupes +1
+1 for append_skipdupes
Meantime you can use this https://pypi.org/project/pangres/
+1 for append_skipdupes
+1 for append_skipdupes
+1 for append_skipdupes. IMO, an option to update the duplicates would also be nice. Perhaps append_updatedupes.
+1
I have made small script for my use to allow INSERT IGNORE in mysql:
def save_dataframe(self, df: pd.DataFrame, table: str):
'''
Save dataframe to the database.
Index is saved if it has name. If it's None it will not be saved.
It implements INSERT IGNORE when inserting rows into the MySQL table.
Table needs to exist before.
Arguments:
df {pd.DataFrame} -- dataframe to save
table {str} -- name of the db table
'''
if df.index.name is None:
save_index = False
else:
save_index = True
self._insert_conflict_ignore(df=df, table=table, index=save_index)
def _insert_conflict_ignore(self, df: pd.DataFrame, table: str, index: bool):
"""
Saves dataframe to the MySQL database with 'INSERT IGNORE' query.
First it uses pandas.to_sql to save to temporary table.
After that it uses SQL to transfer the data to destination table, matching the columns.
Destination table needs to exist already.
Final step is deleting the temporary table.
Parameters
----------
df : pd.DataFrame
dataframe to save
table : str
destination table name
"""
# generate random table name for concurrent writing
temp_table = ''.join(random.choice(string.ascii_letters) for i in range(10))
try:
df.to_sql(temp_table, self.conn, index=index)
columns = self._table_column_names(table=temp_table)
insert_query = f'INSERT IGNORE INTO {table}({columns}) SELECT {columns} FROM `{temp_table}`'
self.conn.execute(insert_query)
except Exception as e:
print(e)
# drop temp table
drop_query = f'DROP TABLE IF EXISTS `{temp_table}`'
self.conn.execute(drop_query)
def _table_column_names(self, table: str) -> str:
"""
Get column names from database table
Parameters
----------
table : str
name of the table
Returns
-------
str
names of columns as a string so we can interpolate into the SQL queries
"""
query = f"SELECT column_name FROM information_schema.columns WHERE table_name = '{table}'"
rows = self.conn.execute(query)
dirty_names = [i[0] for i in rows]
clean_names = '`' + '`, `'.join(map(str, dirty_names)) + '`'
return clean_names
https://gist.github.com/tombohub/0c666583c48c1686c736ae2eb76cb2ea
+1 for append_skipdupes
Rather than upvoting this issue which already has a lot of votes, someone could help with this pr: https://github.com/pandas-dev/pandas/pull/29636
Instead of skipping duplicates an option to choose between raise
, ignore
and replace
would be even better. This way you can choose to have an exception raised, skip the duplicate or have the duplicate row removed and the new data inserted.
Agreed that it would be good to be able to deal with this with options in
df.to_sql()
.Here's the workaround I use in sqlite:
CREATE TABLE IF NOT EXISTS my_table_name ( some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE, ...
Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.
Is there a postgresql equivalent for this?
Agreed that it would be good to be able to deal with this with options in
df.to_sql()
. Here's the workaround I use in sqlite:CREATE TABLE IF NOT EXISTS my_table_name ( some_kind_of_id INT PRIMARY KEY ON CONFLICT IGNORE, ...
Then, when I insert duplicates, they get silently ignored, and the non-duplicates are processed correctly. In my case, the data are (i.e. should be) static, so I don't need to update. It's just that the form of the data feed is such that I'll get duplicates that are ignorable.
Is there a postgresql equivalent for this?
Unfortunately, I couldn't find an equivalent for this on PostgreSQL when creating the table. (You can use this in insert or update commands but that's not the case here)
A problem not solved from 2017 to 2022?
if_exists
operates on table level, an extra keyword arg for 'skip_duplicates' is also acceptable