ibis icon indicating copy to clipboard operation
ibis copied to clipboard

bug: `read_parquet` and similar methods silently overwrite tables

Open deepyaman opened this issue 1 year ago • 8 comments

What happened?

Here's an example derived from a user issue:

>>> import ibis
>>> t = ibis.examples.penguins.fetch()
>>> t.to_parquet("p1.parquet")
>>> t.rename("2_{name}").to_parquet("p2.parquet")
>>> con = ibis.duckdb.connect()
>>> p1 = con.read_parquet("p1.parquet", table_name="p")
>>> con.tables
Tables
------
- p
>>> p2 = con.read_parquet("p2.parquet", table_name="p")
>>> con.tables  # This won't look good...
Tables
------
- p
>>> p1.join(p2, p1["species"] == p2["2_species"]).execute()  # Ibis compiled the expression, but it's not valid DuckDB, since the table definition got overridden
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/joins.py", line 190, in wrapper
    return method(self._finish(), *args, **kwargs)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/core.py", line 393, in execute
    return self._find_backend(use_default=True).execute(
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1349, in execute
    table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1283, in _to_duckdb_relation
    return self.con.sql(sql)
duckdb.duckdb.BinderException: Binder Error: Values list "t2" does not have a column named "species"
>>> p1.join(p2, p1["2_species"] == p2["2_species"]).execute()  # Ibis won't be happy, because it still thinks p1 is the table that isn't renamed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 775, in __getitem__
    values = self.bind(args)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 227, in bind
    values.extend(bind(self, arg))
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/relations.py", line 98, in bind
    yield ops.Field(table, value).to_expr()
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/common/bases.py", line 72, in __call__
    return cls.__create__(*args, **kwargs)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/common/grounds.py", line 120, in __create__
    return super().__create__(**kwargs)
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/operations/relations.py", line 86, in __init__
    raise IbisTypeError(
ibis.common.exceptions.IbisTypeError: Column '2_species' is not found in table. Existing columns: 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'year'.
>>> p1["species"].execute()  # Even simple stuff will fail
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/expr/types/core.py", line 393, in execute
    return self._find_backend(use_default=True).execute(
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1349, in execute
    table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
  File "/opt/miniconda3/envs/ibisml/lib/python3.10/site-packages/ibis/backends/duckdb/__init__.py", line 1283, in _to_duckdb_relation
    return self.con.sql(sql)
duckdb.duckdb.BinderException: Binder Error: Values list "t0" does not have a column named "species"

Shouldn't read_parquet, etc. just have an overwrite option? If you don't specify that, it should not let you clobber an existing table.

What version of ibis are you using?

9.0.0 (initially reported on 8.0.0)

What backend(s) are you using, if any?

DuckDB, but can be replicated on other backends

Relevant log output

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

deepyaman avatar May 15 '24 22:05 deepyaman

This does seem like a potential footgun.

I think with DuckDB, at least, read parquet wouldn't overwrite an existing table, but read_parquet is creating views and we're likely doing a create or replace.

I think an overwrite kwarg is a good idea

gforsyth avatar May 16 '24 12:05 gforsyth

Just to spell this out concretely:

There should be an overwrite kwarg to read_parquet (and read_csv, read_json, etc...) that defaults to False.

In terms of creating tables / views, we should default to CREATE VIEW name AS ... and only throw in a CREATE OR REPLACE if overwrite=True

gforsyth avatar May 22 '24 18:05 gforsyth

@gforsyth Hi I can help some of them

But have a question read parquet in duckdb used by register also, in this case should we also update register or should we set default param for overwrite to True

alperentahta avatar Jun 08 '24 17:06 alperentahta

HI @alperentahta -- the default for overwrite should be False -- also, register is now deprecated. I think whether a user calls register or read_parquet, we should not clobber existing tables.

gforsyth avatar Jun 10 '24 13:06 gforsyth

@gforsyth thanks for the response, when I keep default register False, unit tests for register is failing, (they are getting the exception we throw during the cases overwrite=False) in this case what should I do?

alperentahta avatar Jun 10 '24 13:06 alperentahta

How about we just switch to CREATE VIEW and not add the complexity of a keyword argument?

If people want to drop a table or view first we have a method for that.

cpcloud avatar Jun 10 '24 13:06 cpcloud

@cpcloud Are you talking about doing this inside read parquet, or completely removing table generation capability from read_parquet?

alperentahta avatar Jun 10 '24 13:06 alperentahta

This would be inside read_parquet. Table name generation would remain unchanged when table_name is unspecified by the user.

cpcloud avatar Jun 10 '24 13:06 cpcloud