databricks-sql-python icon indicating copy to clipboard operation
databricks-sql-python copied to clipboard

How to deal with struct types in sqlalchemy

Open DaanRademaker opened this issue 1 year ago • 8 comments

Hi, First of all thank you for the awesome work so far. This really improves the development experience for Databricks and finally allows tools such as alembic to work with delta tables :)

However, we have some tables that contain nested columns (with struct types) and I don't see support for this right now. Is this correct or am I missing something?

Thanks in advance!

DaanRademaker avatar Jan 19 '24 15:01 DaanRademaker

I don't see support for this right now.

Correct, the SQLAlchemy integration currently doesn't expose a STRUCT(), MAP(), or ARRAY() types. However we're actively investigating how to introduce this. I'm going to leave this issue open on the repository so that other searchers can find it.

Would you please reach out to us at [email protected] so we can gather more information about your use-case?

susodapop avatar Jan 19 '24 20:01 susodapop

Thanks for the quick response. I send an email to the above mentioned email address!

DaanRademaker avatar Jan 22 '24 07:01 DaanRademaker

@susodapop @kravets-levko I have the same issue too. I have a table with fields of Array type and Struct type. When making migrations, and especially when using: Base.metadata.create_all(db_engine) It says something like Array type not supported and table cannot be created unfortunately. Please can this feature be added, I think it would be really helpful in setting up tables and making migrations.

Henadek avatar May 15 '24 07:05 Henadek

Would using composites be a potential approach? https://docs.sqlalchemy.org/en/14/orm/composites.html#sqlalchemy.orm.composite

DaanRademaker avatar May 24 '24 09:05 DaanRademaker

If we want to keep the structype as some dict or json type only on the sqlalchemy side the approach I think it more trivial as the only thing that needs to be correctly specific is the col spec. An implementation of this would look something like this

from sqlalchemy import Integer

# Define inner nested columns
inner_nested_columns = [
    Column("inner_nested_column1", Integer),
    Column("inner_nested_column2", Integer),
    # Add more inner nested columns here if needed
]

# Define outer nested columns with another level of nesting
outer_nested_columns = [
    Column("outer_nested_column", Integer),
    Column("inner_struct_field", StructField(inner_nested_columns)),
    # Add more outer nested columns here if needed
]

# Use StructField in your model
class ExampleModel(Base):
    __tablename__ = 'example_table'
    id = Column(Integer, primary_key=True)
    data = Column(StructField(outer_nested_columns)

DaanRademaker avatar May 24 '24 09:05 DaanRademaker

@DaanRademaker

I think it more trivial as the only thing that needs to be correctly specific is the col spec

It's a bit more complicated than that, unfortunately. You're correct that implementing a STRUCT (and also ARRAY) type in SQLAlchemy is fairly trivial. But the limiting factor is deeper in the stack. Namely, Databricks compute doesn't support binding complex types as parameters when running queries. You can see this in https://github.com/databricks/databricks-sql-python/blob/main/src/databricks/sql/parameters/native.py where there is a native parameter binding object for all of the DatabricksSupportedType enum members except for STRUCT, MAP, and ARRAY (the complex types).

Without support for native complex type parameter binding in databricks-sql-connector, properly supporting these types in the SQLAlchemy dialect is a no-go. We could implement the TypeDecorator code in a couple hours. But the moment you run it, the SQL warehouse will throw a ServerOperationError due to the unrecognised parameter type.

There is a possible workaround, however: disable native parameters when using the SQLAlchemy dialect. As you may know, databricks-sql-connector actually supports two methods of parameter binding: native and inline (read about these here). Native parameter binding is used by default. With inline parameters, the values are serialized into the query text. For example datetime(1991,8,3) becomes '1991-08-03'.

In inline mode, complex type binding would be possible for the same reason that you can write a plain text SQL that fetches and manipulates complex types (the syntax for which is described here). To support this, we'd need to teach databricks-sql-connector how to write that syntax when serialising a dict as a STRUCT in a SQL query. Which is complicated, but doable. The SQL syntax is almost JSON, but not quite 😆

That may be the direction Databricks choses to go, here. Although it's an uncomfortable one since inline parameters are a security compromise. The connector does its best to avoid SQL injection by escaping parameterised inputs client-side. But native parameters are the truly safest option, and also tend to be faster to execute than their inline counterparts.

The above discussion is what primarily blocked this effort when I was working on this connector day-to-day. It wasn't a resourcing issue but rather one of server support. Last I heard, complex type parameter binding was being designed so hope was on the horizon.

Also just to note, even though you can't bind complex types, the connector already supports SELECTing complex types. If you select a MAP field you'll get a numpy.ndarray; STRUCT and ARRAY come back as arrow types. For now, that mapping only works in one direction (Databricks → client).

susodapop avatar May 24 '24 13:05 susodapop

It's a bit more complicated than that, unfortunately. I had a feeling that might be the case xD. Thank you for the extensive explanation. Learned a lot more of the internal working of this package.

There is a possible workaround, however: disable native parameters when using the SQLAlchemy dialect. Although I think the documentation says this will be removed in some future release? This would mean building a somewhat deprecated feature from the start.

I guess the best order would perhaps be:

  1. Wait for the complex type parameter bindings to be designed? Is this on the roadmap somewhere?
  2. When these are available create support for Complex Types in sqlalchemy dialect?

DaanRademaker avatar May 27 '24 06:05 DaanRademaker

To speak more holistically, there are three things that need to be updated to support complex types:

SQLAlchemy dialect → databricks-sql-connector → Databricks compute

the documentation says this will be removed in some future release?

Yep. The last I was aware (six months ago) this was considered a viable short-term solution, despite the forthcoming deprecation, due to the complexity of implementing native complex type binding.

Is this on the roadmap somewhere?

Your guess is as good as mine. I'd speak to someone in Databricks support so you can register your interest in this feature. I'm not sure how much this issue tracker is monitored, at present.

susodapop avatar May 28 '24 19:05 susodapop