bug: `read_parquet` and similar methods silently overwrite tables
What happened?
Here's an example derived from a user issue:
>>> import ibis
>>> t = ibis.examples.penguins.fetch()
>>> t.to_parquet("p1.parquet")
>>> t.rename("2_{name}").to_parquet("p2.parquet")
>>> con = ibis.duckdb.connect()
>>> p1 = con.read_parquet("p1.parquet", table_name="p")
>>> con.tables
Tables
------
- p
>>> p2 = con.read_parquet("p2.parquet", table_name="p")
>>> con.tables # This won't look good...
Tables
------
- p
>>> p1.join(p2, p1["species"] == p2["2_species"]).execute() # Ibis compiled the expression, but it's not valid DuckDB, since the table definition got overridden
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/joins.py", line 190, in wrapper
return method(self._finish(), *args, **kwargs)
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/core.py", line 393, in execute
return self._find_backend(use_default=True).execute(
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1349, in execute
table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1283, in _to_duckdb_relation
return self.con.sql(sql)
duckdb.duckdb.BinderException: Binder Error: Values list "t2" does not have a column named "species"
>>> p1.join(p2, p1["2_species"] == p2["2_species"]).execute() # Ibis won't be happy, because it still thinks p1 is the table that isn't renamed
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 775, in __getitem__
values = self.bind(args)
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 227, in bind
values.extend(bind(self, arg))
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 98, in bind
yield ops.Field(table, value).to_expr()
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/common/bases.py", line 72, in __call__
return cls.__create__(*args, **kwargs)
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/common/grounds.py", line 120, in __create__
return super().__create__(**kwargs)
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/operations/relations.py", line 86, in __init__
raise IbisTypeError(
ibis.common.exceptions.IbisTypeError: Column '2_species' is not found in table. Existing columns: 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year'.
>>> p1["species"].execute() # Even simple stuff will fail
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/core.py", line 393, in execute
return self._find_backend(use_default=True).execute(
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1349, in execute
table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1283, in _to_duckdb_relation
return self.con.sql(sql)
duckdb.duckdb.BinderException: Binder Error: Values list "t0" does not have a column named "species"
Shouldn't read_parquet, etc. just have an overwrite option? If you don't specify that, it should not let you clobber an existing table.
What version of ibis are you using?
9.0.0 (initially reported on 8.0.0)
What backend(s) are you using, if any?
DuckDB, but can be replicated on other backends
Relevant log output
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
This does seem like a potential footgun.
I think with DuckDB, at least, read parquet wouldn't overwrite an existing table, but read_parquet is creating views and we're likely doing a create or replace.
I think an overwrite kwarg is a good idea
Just to spell this out concretely:
There should be an overwrite kwarg to read_parquet (and read_csv, read_json, etc...) that defaults to False.
In terms of creating tables / views, we should default to CREATE VIEW name AS ... and only throw in a CREATE OR REPLACE if overwrite=True
@gforsyth Hi I can help some of them
But have a question read parquet in duckdb used by register also, in this case should we also update register or should we set default param for overwrite to True
HI @alperentahta -- the default for overwrite should be False -- also, register is now deprecated. I think whether a user calls register or read_parquet, we should not clobber existing tables.
@gforsyth thanks for the response, when I keep default register False, unit tests for register is failing, (they are getting the exception we throw during the cases overwrite=False) in this case what should I do?
How about we just switch to CREATE VIEW and not add the complexity of a keyword argument?
If people want to drop a table or view first we have a method for that.
@cpcloud Are you talking about doing this inside read parquet, or completely removing table generation capability from read_parquet?
This would be inside read_parquet. Table name generation would remain unchanged when table_name is unspecified by the user.