iceberg-python Modify SQLCatalog initialization so that classes are not always created and update how these classes are created to be more open to tother DB's

Feature Request / Improvement

I think there are 2 issues with the SQLCatalog constructor:

https://github.com/apache/iceberg-python/blob/d587e6724685744918ecf192724437182ad01abf/pyiceberg/catalog/sql.py#L117

The automatic creation of non-existent SQLCatalog tables is great for testing, but not something I would want in code that was being deployed and run even in a QA environ. Applying DDL should be part of a seperate process. Constructor should at least take a parameter that allows you to not create the tables automatically.
The current method used for creating the tables in _ensure_tables_exists uses exceptions which means that the library is compatible with SQLite and Postgres simply because the except clause handles the errors these DB's throw when selecting from a non-existant DB. The logic here should just check for the existence of the tables through a standard SQLAlchemy method, which will also make the SQLCatalog work with most DBs

I have made these changes on my local branch and can submit a PR but wanted to open up conversation to other opinions.

Sep 09 '24 15:09 isc-psulin

Hi @isc-patrick - thank you for raising this. These are great points.

I'm wondering if we can keep the current approach of running _ensure_table_exists within the instantiation of the SqlCatalog, but adopt an improved way of running the table creation DDL as you suggested in (2). The reason is, because I think as long as the function is idempotent (similar to create_table_if_not_exists or CREATE TABLE IF NOT EXISTS), it's simple enough to keep in the initialization of the SqlCatalog, which requires these tables to execute any of its member functions.

I'm of the impression that we could instead just remove the additional try exception handling and just invoke SqlCatalogBaseTable.metadata.create_all(self.engine). According to the SqlAlchemy docs, this uses the default flag of checkfirst=True, which avoids creating the tables if they already exist.

https://docs.sqlalchemy.org/en/20/core/metadata.html#sqlalchemy.schema.MetaData.create_all

What do you think about this suggestion?

Sep 09 '24 17:09 sungwy

My thought was to pass an additional argument into init that allowed you to turn off creation of tables. My concern is that I don't think you will want this auto-create to happen on any system in production, as the catalog should already exist and I really want it to fail fast if it does not exist instead of creating new tables and failing later, like when the system is live. In general, I would never have DDL applied to a DB as part of a dependant applications runtime.

So that would mean passing a new boolen into the constructor, say create_if_not_exists: bool = True, before the **properties and then replacing the _ensure_table_exists method with the create_all conditioned on the new bool.

Sep 09 '24 18:09 isc-psulin

Given that the catalog is instantiated through load_catalog entrypoint function, I think we'd actually need to introduce a catalog property as well, if we decide to introduce a new boolean flag to control this behavior. That being said, I'm still not convinced that this is necessary...

I think keeping the approach of creating the tables on catalog initialization should result in the following:

if the tables exist, the DDL shouldn't be executed
if the tables don't exist, they are created
if the tables don't exist and the DDL fails to create the tables (e.g. because the DB doesn't allow creation of tables through standard DDLs), the DB should throw an error that should result in the program failing fast.

I really want it to fail fast if it does not exist instead of creating new tables and failing later, like when the system is live

I'm not sure if I follow this point @isc-patrick . If we disable the creation of the table through the DDL, we will be failing at the same point, or even later. Since instead of failing at the Catalog initialization (for example by failing to create the tables), we will now be failing at a later part of the code that has assumed that the tables exist, and tries to execute one of the SQL queries for scans or commits.

Sorry to probe - could you elaborate on how you expect the proposed approach of suppressing table creation would result in the application failing fast?

Sep 09 '24 18:09 sungwy

I had not yet seen the change introducing load_catalog instead of using the SQLCatalog constructor, so just updated to v0.70 - I was away for couple weeks.

My point for the creation of the table comes from work with clients, mostly in Financial Services, where there is a formal separation of code that applies DDL vs DML. The entire process of Database change control is seperate from any application lifecycle. The idea that an application using a DB for reading and writing data would be allowed to change the schema would just never be considered as a possibility. Let's just forget that and chalk it up to different approaches to SDLC. This can easily handled by adding a wrapper to load_catalog if necessary.

I can make the change to simplify the _ensure_tables_exist to simply call the create_all.

Sep 09 '24 19:09 isc-psulin

I work in financial services as well, and actually have the same separation between DDL and DML. But I agree with your point in that it would be best to not generalize the practices in our industry with other potential development practices in different industries, or at different scales of company (start up versus big firms).

I can make the change to simplify the _ensure_tables_exist to simply call the create_all.

That would be amazing @isc-patrick . Maybe we can start there, and see if there's still a need to make the other proposed change. Thank you again for raising this issue and leading this discussion!

Sep 09 '24 20:09 sungwy

I'm not sure how much the Python API is trying to conform to the Java API, but this same feature for the same reason was requested and implemented for the JDBC catalog, https://github.com/apache/iceberg/pull/10124.

Sep 15 '24 14:09 isc-psulin

Hi @isc-patrick - thank you for putting up the PR and, sharing the related Java link.

I thought about this a lot more in light of @Fokko ’s review comment on https://github.com/apache/iceberg-python/pull/1155 and I think introducing the flag to disable the table creation DDL on initialization (rather than removing _ensure_tables_exist as I had originally suggested) would be best to resolve this issue for now.

If as you noted, the permission to run create_all with the checkfirst=True varies between dialects, this change may result in a backward incompatible change for some sql dialects. If any initialization relies on CREATE TABLE permissions to run create_all it will begin failing with this change.

The disable flag that you suggested will keep the existing behavior for all applications. The different backends that are current not supported, because the current list of exceptions don’t catch the specific exception thrown by the DB backend, can enable the flag to successfully initialize the catalog.

Sep 15 '24 20:09 sungwy

I think you were correct about removing ensure_tables_exist. The question comes down to what is the best method to use for checking the existence of a table. Create_all() relies on the default method in SQLAlchemy, https://docs.sqlalchemy.org/en/20/core/internals.html#sqlalchemy.engine.Dialect.has_table, which uses the DB metadata. This is also what the Java implementation is doing: https://github.com/mrcnc/iceberg/blob/928e52ae888f601e1d8d81ae2405d677db10f470/core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java#L155

Also, the current ensure_tables_exists method has the following issues:

It doesn't really check if a table exists, only whether a SELECT statement can be executed on that table. In most RBAC based permissioning, a table can exist and the user could see that the table exists because they have permissions to the DB, but cannot SELECT from that table because a SELECT is a different permission, which would result in a false negative.
It also could fail for other reasons, any other Programming or Database Exception that the Dailect raises, and result in a false positive. This also restricts the databases that are supported to just the ones that throw the errors that are included in the except clause. This was actaully the starting point for my PR because my companies database throws a different error.

So I will add the flag init_catalog_tables and can add back the _ensure_tables_exist method if that is desired. If I add that back, the only question is what DB Exceptions should be handled if the SELECT fails?

Sep 16 '24 12:09 isc-psulin

Create_all() relies on the default method in SQLAlchemy, https://docs.sqlalchemy.org/en/20/core/internals.html#sqlalchemy.engine.Dialect.has_table, which uses the DB metadata. This is also what the Java implementation is doing: https://github.com/mrcnc/iceberg/blob/928e52ae888f601e1d8d81ae2405d677db10f470/core/src/main/java/org/apache/iceberg/jdbc/JdbcCatalog.java#L155

I see. My concern arising from the review comments was not about the correctness of this approach, but with the possibility of introducing a disruptive behavior change to the existing users. But given that SqlAlchemy relies on has_table function to on checkfirst, I think the proposed change to remove ensure_tables_exist will make the check more permissive, than restrictive.

On that note, I'm convinced based on your argument that we should remove ensure_tables_exist! 🙂

Thinking forward to how we could introduce this change - I think it would be important to add tests to validate that the change would be safe for table read-only permissioned users in PostgreSQL and sqlite (which are the two formally supported DB backends with the SQLCatalog). Unfortunately, I'm realizing that we don't actually have an integration test suite for PostgreSQL backed SQLCatalog.

I've raised this issue to track this, and I think it would be easier to introduce a change of this nature once we have a test suite for each of our officially supported DB backends.

So I'm thinking we could work on your suggestions in the following order:

introduce a flag that ignores table creation if enabled (backward compatible, and safe to introduce)
add test suite for postgresql #1178
remove _ensure_tables_exist

What do you think?

Sep 16 '24 13:09 sungwy

The problem I see with moving into the direction of creating integration tests for specific DB's(outside of SQLite) is that you are opening up to an enormous amount of possible tests. Postgres has dozens of releases over the past few years, https://www.postgresql.org/docs/release/, and SQLAlchemy supports 6 different Postgres drivers, https://docs.sqlalchemy.org/en/13/dialects/postgresql.html. Which combination of the 100's of possible ones are the integration tests going to cover? How many docker containers will be needed to be downloaded to run the int tests? And all of this to prove that the most ubiquitous OS SQL DB and Python ORM work with the 2 table SQLCatalog schema?

I would skip the integration tests with Postgres and make some statements about assumptions for SQL databases working with pyiceberg via SQLAlchemy. SQLAlchemy is a well established abstraction layer and the SQLCatalog is extremely simple.

Having followed this repo and the Slack for the last couple weeks, I know how hard the main commiters are working on this project and how much work there still is to bring this library up to speed with the other implementations. All of these tests add some amount of drag to a library - maintaining them, running them before commits, and most of all setting a precedent for the next set of developments like adding support for other SQL backends.

I'm happy to help execute either way.

Sep 16 '24 15:09 isc-psulin

Hi @isc-patrick - thank you very much for taking the time to bring these points to discussiong. I think these are important questions for us to answer and take a stance on as the project grows, and it's great to have all your fantastic considerations documented here for future reference.

I would skip the integration tests with Postgres and make some statements about assumptions for SQL databases working with pyiceberg via SQLAlchemy. SQLAlchemy is a well established abstraction layer and the SQLCatalog is extremely simple.

I think this is an interesting point. Rather than expressing support for specific DB and driver as 'supported' backends for the SqlCatalog, I think listing out the requirements of the dialect is a much more scalable way of supporting the SqlCatalog. This would make sense in the long run, especially since the Iceberg Community is thinking of the REST Catalog as the strategic catalog implementation. Given that context, maybe it won't make sense for us to expend so much effort in maintaining SqlCatalog's first class support against multiple DB backends. However, given that our documentation already expresses formal support for postgresql+psycopg2, I think we should raise a discussion to the mailing list if we want to remove this lingo of formal support and move in this direction.

Postgres has dozens of releases over the past few years, https://www.postgresql.org/docs/release/, and SQLAlchemy supports 6 different Postgres drivers, https://docs.sqlalchemy.org/en/13/dialects/postgresql.html. Which combination of the 100's of possible ones are the integration tests going to cover?

This is a valid point. I think my stance on this is that we don't have to cover all combinations of the driver in our tests, but just the ones we choose to more 'formally' support. For now, postgresql+psycopg2 and sqlite are already two types of connections we formally support according to our documentation and dependency management. I think it is fair to restrict our integration tests two these backends, and only test on the latest released versions for simplicity.

Sep 16 '24 17:09 sungwy

Sounds good. I added ensure_tables_exist back to the code along with the new flag and the tests in the branch with the PR.

Sep 16 '24 18:09 isc-psulin

Fixed by #1155

Dec 19 '24 19:12 kevinjqliu