feat: Better error when creating table with conflicting names due to capitalization
Is your feature request related to a problem?
Ibis treats columns named "a" and "A" differently, and you can have both in a relation. I think this makes, some backends support this.
However, there are some sharp edges when you try to use this with a backend that DOESN'T distinguish between "A" and "a", such as duckdb:
import ibis
t = ibis.memtable({"a": [1, 2, 3]})
t = t.mutate(A=t.a)
t.cache()
Details
---------------------------------------------------------------------------
CatalogException Traceback (most recent call last)
Cell In[2], line 7
5 t = ibis.memtable({"a": [1, 2, 3]})
6 t = t.mutate(A=t.a)
----> 7 t.cache()
File ~/code/ibis/ibis/expr/types/relations.py:3531, in Table.cache(self)
3458 """Cache the provided expression.
3459
3460 All subsequent operations on the returned expression will be performed
(...)
3528 └─────────┴───────────┴────────────────┴───────────────┴───────────────────┴───┘
3529 """
3530 current_backend = self._find_backend(use_default=True)
-> 3531 return current_backend._cached_table(self)
File ~/code/ibis/ibis/backends/__init__.py:772, in CacheHandler._cached_table(self, table)
770 entry = self._cache_op_to_entry.get(table.op())
771 if entry is None or (cached_op := entry.cached_op_ref()) is None:
--> 772 cached_op = self._create_cached_table(util.gen_name("cached"), table).op()
773 entry = CacheEntry(
774 table.op(),
775 weakref.ref(cached_op),
(...)
778 ),
779 )
780 self._cache_op_to_entry[table.op()] = entry
File ~/code/ibis/ibis/backends/__init__.py:805, in CacheHandler._create_cached_table(self, name, expr)
804 def _create_cached_table(self, name: str, expr: ir.Table) -> ir.Table:
--> 805 return self.create_table(name, expr, schema=expr.schema(), temp=True)
File ~/code/ibis/ibis/backends/duckdb/__init__.py:197, in Backend.create_table(self, name, obj, schema, database, temp, overwrite)
193 # This is the same table as initial_table unless overwrite == True
194 final_table = sg.table(
195 name, catalog=catalog, db=database, quoted=self.compiler.quoted
196 )
--> 197 with self._safe_raw_sql(create_stmt) as cur:
198 if query is not None:
199 insert_stmt = sge.insert(query, into=initial_table).sql(self.name)
File ~/.pyenv/versions/3.11.6/lib/python3.11/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
135 del self.args, self.kwds, self.func
136 try:
--> 137 return next(self.gen)
138 except StopIteration:
139 raise RuntimeError("generator didn't yield") from None
File ~/code/ibis/ibis/backends/duckdb/__init__.py:319, in Backend._safe_raw_sql(self, *args, **kwargs)
317 @contextlib.contextmanager
318 def _safe_raw_sql(self, *args, **kwargs):
--> 319 yield self.raw_sql(*args, **kwargs)
File ~/code/ibis/ibis/backends/duckdb/__init__.py:97, in Backend.raw_sql(self, query, **kwargs)
95 with contextlib.suppress(AttributeError):
96 query = query.sql(dialect=self.name)
---> 97 return self.con.execute(query, **kwargs)
CatalogException: Catalog Error: Column with name A already exists!
related issues, though I don't think exactly the same:
- https://github.com/ibis-project/ibis/issues/9853
- https://github.com/ibis-project/ibis/issues/6772
IDK, this may also just be a bug with the duckdb backend, not a more general feature request.
What is the motivation behind your request?
I had a bug in my code where I accidentally created both columns "A" and "a". Then, I got this cryptic error later.
Describe the solution you'd like
Not sure exactly what to do here, but a few ideas:
- If a table is bound to a backend that doesn't distinguish between upper/lower (eg duckdb), could we fail earlier? Probably not a good idea, I would think we want all expressions to be portable between backends.
- catch this error whenever we actually interact with the backend, and give a nicer error there. I think this would be the solution I prefer.
What version of ibis are you running?
main
What backend(s) are you using, if any?
duckdb, not sure the story on other backends.
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
To do this, we'd have to encode the logic of every backend's identifier casing rules inside of Ibis.
DuckDB for some reason has decided on this--in my opinion--extremely confusing hybrid that makes it really difficult to predict when things will 1) work, 2) not fail but generate a deduplicated name with a suffix, or 3) utterly fail with an inscrutable error like that one you're showing.
While I sympathize with the ask here and I wish humanity would really just stop already with case-insensitive things in computers, I don't think the work and continued maintenance effort is justified.
Ok, if it's a big burden then I think it's fine to just punt, it's probably not THAT common. Maybe I can fix up this issue title to be SEO so others can find it.
Would it make sense to make a feature request for this in sqlglot? I haven't really looked at their architecture, but maybe since they already have so many per-backend rules, eg quoting, then this wouldn't be as hard for them as it would be for us?
I don't think it would make sense in sqlglot, as they generally aren't trying to unify database semantics where it's not easily possible.