databricks-sql-python
databricks-sql-python copied to clipboard
How to deal with struct types in sqlalchemy
Hi, First of all thank you for the awesome work so far. This really improves the development experience for Databricks and finally allows tools such as alembic to work with delta tables :)
However, we have some tables that contain nested columns (with struct types) and I don't see support for this right now. Is this correct or am I missing something?
Thanks in advance!
I don't see support for this right now.
Correct, the SQLAlchemy integration currently doesn't expose a STRUCT()
, MAP()
, or ARRAY()
types. However we're actively investigating how to introduce this. I'm going to leave this issue open on the repository so that other searchers can find it.
Would you please reach out to us at [email protected]
so we can gather more information about your use-case?
Thanks for the quick response. I send an email to the above mentioned email address!
@susodapop @kravets-levko
I have the same issue too. I have a table with fields of Array type and Struct type. When making migrations, and especially when using:
Base.metadata.create_all(db_engine)
It says something like Array type not supported and table cannot be created unfortunately.
Please can this feature be added, I think it would be really helpful in setting up tables and making migrations.
Would using composites be a potential approach? https://docs.sqlalchemy.org/en/14/orm/composites.html#sqlalchemy.orm.composite
If we want to keep the structype as some dict or json type only on the sqlalchemy side the approach I think it more trivial as the only thing that needs to be correctly specific is the col spec. An implementation of this would look something like this
from sqlalchemy import Integer
# Define inner nested columns
inner_nested_columns = [
Column("inner_nested_column1", Integer),
Column("inner_nested_column2", Integer),
# Add more inner nested columns here if needed
]
# Define outer nested columns with another level of nesting
outer_nested_columns = [
Column("outer_nested_column", Integer),
Column("inner_struct_field", StructField(inner_nested_columns)),
# Add more outer nested columns here if needed
]
# Use StructField in your model
class ExampleModel(Base):
__tablename__ = 'example_table'
id = Column(Integer, primary_key=True)
data = Column(StructField(outer_nested_columns)
@DaanRademaker
I think it more trivial as the only thing that needs to be correctly specific is the col spec
It's a bit more complicated than that, unfortunately. You're correct that implementing a STRUCT
(and also ARRAY
) type in SQLAlchemy is fairly trivial. But the limiting factor is deeper in the stack. Namely, Databricks compute doesn't support binding complex types as parameters when running queries. You can see this in https://github.com/databricks/databricks-sql-python/blob/main/src/databricks/sql/parameters/native.py where there is a native parameter binding object for all of the DatabricksSupportedType
enum members except for STRUCT
, MAP
, and ARRAY
(the complex types).
Without support for native complex type parameter binding in databricks-sql-connector
, properly supporting these types in the SQLAlchemy dialect is a no-go. We could implement the TypeDecorator
code in a couple hours. But the moment you run it, the SQL warehouse will throw a ServerOperationError
due to the unrecognised parameter type.
There is a possible workaround, however: disable native parameters when using the SQLAlchemy dialect. As you may know, databricks-sql-connector
actually supports two methods of parameter binding: native and inline (read about these here). Native parameter binding is used by default. With inline parameters, the values are serialized into the query text. For example datetime(1991,8,3)
becomes '1991-08-03'
.
In inline mode, complex type binding would be possible for the same reason that you can write a plain text SQL that fetches and manipulates complex types (the syntax for which is described here). To support this, we'd need to teach databricks-sql-connector
how to write that syntax when serialising a dict
as a STRUCT
in a SQL query. Which is complicated, but doable. The SQL syntax is almost JSON, but not quite 😆
That may be the direction Databricks choses to go, here. Although it's an uncomfortable one since inline parameters are a security compromise. The connector does its best to avoid SQL injection by escaping parameterised inputs client-side. But native parameters are the truly safest option, and also tend to be faster to execute than their inline counterparts.
The above discussion is what primarily blocked this effort when I was working on this connector day-to-day. It wasn't a resourcing issue but rather one of server support. Last I heard, complex type parameter binding was being designed so hope was on the horizon.
Also just to note, even though you can't bind complex types, the connector already supports SELECT
ing complex types. If you select a MAP
field you'll get a numpy.ndarray
; STRUCT
and ARRAY
come back as arrow types. For now, that mapping only works in one direction (Databricks → client).
It's a bit more complicated than that, unfortunately. I had a feeling that might be the case xD. Thank you for the extensive explanation. Learned a lot more of the internal working of this package.
There is a possible workaround, however: disable native parameters when using the SQLAlchemy dialect. Although I think the documentation says this will be removed in some future release? This would mean building a somewhat deprecated feature from the start.
I guess the best order would perhaps be:
- Wait for the complex type parameter bindings to be designed? Is this on the roadmap somewhere?
- When these are available create support for Complex Types in sqlalchemy dialect?
To speak more holistically, there are three things that need to be updated to support complex types:
SQLAlchemy dialect → databricks-sql-connector → Databricks compute
the documentation says this will be removed in some future release?
Yep. The last I was aware (six months ago) this was considered a viable short-term solution, despite the forthcoming deprecation, due to the complexity of implementing native complex type binding.
Is this on the roadmap somewhere?
Your guess is as good as mine. I'd speak to someone in Databricks support so you can register your interest in this feature. I'm not sure how much this issue tracker is monitored, at present.