airflow Introduce notion of dialects in DbApiHook

This PR introduces the notion of an SQL Dialect class in the DbApiHook.

Let me first elaborate why I did such an approach, this PR is a proposition and can of course be changed where needed but gives an idea on how it could be implemented. The reason why I wanted to introduce this is because I experienced that the _insert_statement_format and _replace_statement_format string formatting properties in the DbApiHook are a bit of a naïve approach. Yes, this approach is generic and doesn't ty the code to a specific database, hence why I want to introduce some kind of dialect, but it's also limited as the parameters passed to the string format are hard-coded and aren't always sufficient for certain databases.

We, for example are using MsSQL and SAP Hana a lot in our DAG's, and since we are using the insert_rows methods as much as possible for inserting/updating data, we liked the approach of the replace parameter which allows you to specify if you want to insert or replace data. That way all our inserts/replace tasks in our DAG's are using the same generic code independently of which database we want to write to. We also use the GenericTransfer operator across multiple databases that way which is great. Beside that, we also implemented a custom SQLInsertRowsOperator which does the same as the GenericTransfer operator but then get's the data from an XCom instead of another database, I'll also propose a PR for that once we have a good solution for the current problem we try to solve in this PR.

So the issue with the current approach for automatically generating the replace statement is when we want to use MsSQL. For SAP Hana it was easy, we just had to override the replace_statement_format parameter in the connection extra field and we were done. For MsSQL that was not possible, that's why I created another PR that is already merged in Airflow for the MsSqlHook, in which I override the _generate_insert_sql method of the DbApiHook to allow a more complex replace statement generation to support the replace option.

The problem with this approach is that this functionality will only work when using the MsSQL connection type (which underneath uses the pymssql library). Unfortunately, we experienced a lot of connection issues and even deadlocks (later one is maybe unrelated) when writing data to MsSQL using the MsSQLHook. Our DBA suggested using the OdbcHook, unfortunately the OdbcHook doesn't have the enhanced _generate_insert_sql for MsSQL as OdbcHook is agnostic of which database you connect to. So in order to check if the OdbcHook would solve our connection issues with MsSQL, I temporarily patched the _generate_insert_sql method of the OdbcHook to test if the issues are solved using the OdbcHook. So we did a load test and indeed we did not experience any connection issues nor deadlocks and the performance was also a lot better compared to the MsSqlHook. With the MsSqlHook, we had to specify the max_active_tis_per_dag=1 parameter to prevent concurrent tasks and avoid deadlocks, with the OdbcHook we didn't have to but once again this can be pure coincidence due to the concurrency nature.

So that's why I started thinking about a more generic approach of this problem, which would work independently of which Connection type you use, and move all the logic to the DbApiHook, hence why I wanted to introduce the notion of a Dialect which can then be specialized where needed per database. Of course we would need to think how we want it to be implemented, here I just did a bit of a naïve implemented to show the idea. We are already using this in our Airflow environment through patched code and it works great, now we have to find a way to implement this idea (if accepted) in a good/clean and future proof approach.

^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Aug 08 '24 10:08 dabla

@eladkal and @potiuk what do you think about this implementation? I would ideally want to use entry points to register the dialects, so that additional dialects can be loaded through different providers, but couldn't find such mechanism in the Airflow code base, apparently, according to ChatGPT, it is possible to do it either via a setup.py or via pyproject.toml. So if you have any suggestions on that case, that would be helpful, unless you guys have other ideas or suggestions or don't like the implementation at all of course. Atm the dialects (for the moment there is only MSSQL) are registered in a hard-coded way in the common-sql provider to make it work, it would be nice if those would be dynamically registered from the dedicated providers.

Bellow the answer from ChatGPT:

When dealing with a monolithic repository like Apache Airflow, where the setup for each provider isn't handled through separate setup.py or pyproject.toml files, but rather as part of a larger project, you'll need to use a different approach to define and register entry points.

In the context of Apache Airflow, which uses a monolithic repository structure with multiple providers, you typically interact with entry points through the main project’s setup configuration. Here's how you can handle entry points within such a setup:

Steps to Define Entry Points in a Monolithic Repo like Airflow
Find the Main setup.py or pyproject.toml:

In a monolithic project like Airflow, there will be a main setup.py or pyproject.toml at the root of the repository. This is the file that defines how the entire project is built and installed.
Locate Entry Points Definition:

Look for a section in the main setup.py or pyproject.toml dedicated to entry points. This section typically aggregates all the entry points from different parts of the project. For Airflow, the entry points for plugins, executors, or other components are defined here.
Add Your Entry Point:

Modify the entry points section to include your specific entry point. This might involve editing a Python dictionary in setup.py or adding a TOML table in pyproject.toml.
Here’s how you might define an entry point in setup.py for an Airflow provider:

# In setup.py of the main Airflow project

setup(
    # other setup arguments...
    entry_points={
        'sqlalchemy.dialects': [
            'mydb = myprovider.mydialect:MyDialect',  # Your custom dialect entry point
        ],
        'airflow.plugins': [
            'myplugin = myprovider.myplugin:MyAirflowPlugin',  # Example of Airflow plugin entry point
        ],
        # other entry points...
    },
    # other setup arguments...
)
If you're using pyproject.toml:

[project.entry-points."sqlalchemy.dialects"]
mydb = "myprovider.mydialect:MyDialect"

[project.entry-points."airflow.plugins"]
myplugin = "myprovider.myplugin:MyAirflowPlugin"

Sep 03 '24 08:09 dabla

We actually already use entrypoints - the provider.yaml "subset" is already exposed in providers and various provider's capabilities are available this way. They are even automatically extracted from provider.yaml's and exported to documentation https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/index.html

You can also see decription of it in https://airflow.apache.org/docs/apache-airflow-providers/howto/create-custom-providers.html#custom-provider-packages

Generally speaking you will have two do few things:

Update provider.yaml schema https://github.com/apache/airflow/blob/main/airflow/provider.yaml.schema.json to add dialects
Update proovider.info schema https://github.com/apache/airflow/blob/main/airflow/provider_info.schema.json - this is the subset of provider.yaml that gets exposed via provider's entrypoint as dictionary
Update ProvidersManager https://github.com/apache/airflow/blob/main/airflow/providers_manager.py to discover dialects automatically and expose them via Python API - but with care about efficiency - there is some caching and lazy loading implemented there so you should follow what other components are doing there.
Add documentation generation to include dialect information in documentation: https://github.com/apache/airflow/blob/main/docs/exts/providers_packages_ref.py + https://github.com/apache/airflow/blob/main/docs/exts/providers_extensions.py
Update airflow providers cli to expose that information as well https://github.com/apache/airflow/blob/main/airflow/cli/commands/provider_command.py
Add provider.yaml entries for all the providers that need to expose them

Once you do it all, the breeze release-management prepare-provider-packages command wil automatically build and expose the dictionaries retrieved from provider.yaml into entrypoints. This is happening dynamically - we are building pyproject.toml for providers while preparing the packages from this JINJA templates:

Another benefit of doing it this way is that "ProvidersManager" will see if provider is available directly in airflow sources and when you run breeze where providers are just available in airflow/providers sources and not installed as separate packages, ProvidersManager will read the dialect information directly from provider.yaml rather than from entrypoint, so inside breeze it will work as if it was installed as package.

Sep 03 '24 10:09 potiuk

We actually already use entrypoints - the provider.yaml "subset" is already exposed in providers and various provider's capabilities are available this way. They are even automatically extracted from provider.yaml's and exported to documentation https://airflow.apache.org/docs/apache-airflow-providers/core-extensions/index.html

You can also see decription of it in https://airflow.apache.org/docs/apache-airflow-providers/howto/create-custom-providers.html#custom-provider-packages

Generally speaking you will have two do few things:

Update provider.yaml schema https://github.com/apache/airflow/blob/main/airflow/provider.yaml.schema.json to add dialects

Update proovider.info schema https://github.com/apache/airflow/blob/main/airflow/provider_info.schema.json - this is the subset of provider.yaml that gets exposed via provider's entrypoint as dictionary

Update ProvidersManager https://github.com/apache/airflow/blob/main/airflow/providers_manager.py to discover dialects automatically and expose them via Python API - but with care about efficiency - there is some caching and lazy loading implemented there so you should follow what other components are doing there.

Add documentation generation to include dialect information in documentation: https://github.com/apache/airflow/blob/main/docs/exts/providers_packages_ref.py + https://github.com/apache/airflow/blob/main/docs/exts/providers_extensions.py

Update airflow providers cli to expose that information as well https://github.com/apache/airflow/blob/main/airflow/cli/commands/provider_command.py

Add provider.yaml entries for all the providers that need to expose them

Once you do it all, the breeze release-management prepare-provider-packages command wil automatically build and expose the dictionaries retrieved from provider.yaml into entrypoints. This is happening dynamically - we are building pyproject.toml for providers while preparing the packages from this JINJA templates:

The entrypoint template

The pyproject template

Another benefit of doing it this way is that "ProvidersManager" will see if provider is available directly in airflow sources and when you run breeze where providers are just available in airflow/providers sources and not installed as separate packages, ProvidersManager will read the dialect information directly from provider.yaml rather than from entrypoint, so inside breeze it will work as if it was installed as package.

Thank you @potiuk for the explanation, will have a look at it and see how to implement it for common sql provider. Another question, at the moment the MsSqlDialect class is also located within the common sql provider just so that it works but where would you put it?

in another new (mssql) dialect provider (bit overkill for just a dialect I would say, unless we would create a dialects provider but still seems odd to me)?
or in the mssql provider (but that one is actually only needed for pymssql and in fact not necessary for odbc, but seems most logical)?

Sep 03 '24 12:09 dabla

I think in mssql provider, we do have a few "loosely related" things in providers already and cross-provider dependencies (both explicit and implicit) and sometimes where things are "between" we make arbitrary decisions.

When we introduced providers we implement a bit "soft and not 100% compilable" rule on where to put such things and the rule is ....

"Where it would be likely maintained best by the major stakeholder of the provider - assuming there is one"

For example when we have S3 -> GCS transfer operator we put it in Google Provider and GCS-> S3 we put it in Amazon provider. The very "soft" and "political" reason for that is that Amazon will be keen on bringing data from GCS and migrating people over from Google and Google will be keen on bringing the data from S3 to GCS.

Not very scientific, I know and sometimes you can argue the decision, but it's best "rule of thumb" we can have on that.

So Assuming that Microsoft (which is not happening BTW) is the key stakeholder in mssql provider - would they want to maintain ODBC dialect there? I certainly think so. Is there anyone who owns "ODBC" who would be interested in maintaining MSSQL dialect there ? While ODBC came from Microsoft, it's a long time "de-facto" standard and there are some independent bodies that standardized the specification https://en.wikipedia.org/wiki/Open_Database_Connectivity - so I think those standard bodies would prefer each "specific" dialect is maintained by whoever is interested in the particular dialect.

So mssql seems like a good place to put it.

Sep 03 '24 13:09 potiuk

I think in mssql provider, we do have a few "loosely related" things in providers already and cross-provider dependencies (both explicit and implicit) and sometimes where things are "between" we make arbitrary decisions.

When we introduced providers we implement a bit "soft and not 100% compilable" rule on where to put such things and the rule is ....

"Where it would be likely maintained best by the major stakeholder of the provider - assuming there is one"

For example when we have S3 -> GCS transfer operator we put it in Google Provider and GCS-> S3 we put it in Amazon provider. The very "soft" and "political" reason for that is that Amazon will be keen on bringing data from GCS and migrating people over from Google and Google will be keen on bringing the data from S3 to GCS.

Not very scientific, I know and sometimes you can argue the decision, but it's best "rule of thumb" we can have on that.

So Assuming that Microsoft (which is not happening BTW) is the key stakeholder in mssql provider - would they want to maintain ODBC dialect there? I certainly think so. Is there anyone who owns "ODBC" who would be interested in maintaining MSSQL dialect there ? While ODBC came from Microsoft, it's a long time "de-facto" standard and there are some independent bodies that standardized the specification https://en.wikipedia.org/wiki/Open_Database_Connectivity - so I think those standard bodies would prefer each "specific" dialect is maintained by whoever is interested in the particular dialect.

So mssql seems like a good place to put it.

Thank you @jarek for your fast and elaborate answer, what you are saying makes sense and is indeed logical. Will try to implement those changes as quickly as possible. Thanks again :)

Sep 03 '24 15:09 dabla

@eladkal @potiuk

I'm getting following error, I know what it means but don't understand what it causes:

Found 5 errors in providers
Error: The `airflow.providers.apache.hive.transfers.mssql_to_hive` object in transfers list in airflow/providers/apache/hive/provider.yaml does not exist or is not a module: name 'TYPE_CHECKING' is not defined
Error: The `airflow.providers.google.cloud.transfers.mssql_to_gcs` object in transfers list in airflow/providers/google/provider.yaml does not exist or is not a module: name 'TYPE_CHECKING' is not defined
Error: The `airflow.providers.google.cloud.transfers.bigquery_to_mssql` object in transfers list in airflow/providers/google/provider.yaml does not exist or is not a module: name 'TYPE_CHECKING' is not defined
Error: The `airflow.providers.microsoft.mssql.hooks.mssql.MsSqlHook` object in connection-types list in airflow/providers/microsoft/mssql/provider.yaml does not exist or is not a class: name 'TYPE_CHECKING' is not defined
Error: The `airflow.providers.microsoft.mssql.hooks.mssql` object in hooks list in airflow/providers/microsoft/mssql/provider.yaml does not exist or is not a module: name 'TYPE_CHECKING' is not defined
Error 1 returned
Changed ownership of 3 files back to 1001:127.
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/tempfile.py:830: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmperqkniqg'>
  _warnings.warn(warn_message, ResourceWarning)

Sep 26 '24 10:09 dabla

@potiuk As I needed to add the dialects notion to ProvidersManager, and the common sql provider now needs the dialects from the ProvidersManager, I also had to change the min required Airflow version to the current one in the common sql provider.yaml. Currently the version of the common sql provider is 1.17.1, I suppose this now will have to change to 1.18.0? If so how do I do this for one provider? Is there a breeze command for that? This change would also mean that all sql dependant providers would required Airflow 3.0

Oct 07 '24 13:10 dabla

Currently the version of the common sql provider is 1.17.1, I suppose this now will have to change to 1.18.0? If so how do I do this for one provider?

This can be done by adding >= in provider.yaml in dependencies (apache-airflow-providers-common-sql>=1.18.0 - it's all that it needs.

This change would also mean that all sql dependant providers would required Airflow 3.0

That's probably out of the question for now. We don't really want that. This should be turned in optional feature of the provider and should work also for Airflow 2 in backwards-compatible way (until min-airflow-version for the providers is set to 3.0 - which is no sooner than say 12 months after 3.0 is out most likely - we have not decided yet for how long we will be supporting provifders for Airflow 2.

This is all possible. Provider-info is extendable - so if provider exposes new entries in provider-info, they will be ignored by Airflow 2's providers manager. So basically we need to make sure that sql providers (and common.sql provider) installed on Airlfow 2 will continue to work. This should be actually even automatically tested by our "Provider compatibility check" - which work in the way that they build provider packages, install them for Airflow 2.8, 2.9, 2.10 (and 2.11 in the future) and run test suite from main to test if they continue to work. The test suite will need to be written in the way to handle those tests - so for example all "dialect discovery" tests should be skipped when Airflow 2 is used - this is all described in https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#compatibility-provider-unit-tests-against-older-airflow-releases

Oct 07 '24 21:10 potiuk

Currently the version of the common sql provider is 1.17.1, I suppose this now will have to change to 1.18.0? If so how do I do this for one provider?

This can be done by adding >= in provider.yaml in dependencies (apache-airflow-providers-common-sql>=1.18.0 - it's all that it needs.

This change would also mean that all sql dependant providers would required Airflow 3.0

That's probably out of the question for now. We don't really want that. This should be turned in optional feature of the provider and should work also for Airflow 2 in backwards-compatible way (until min-airflow-version for the providers is set to 3.0 - which is no sooner than say 12 months after 3.0 is out most likely - we have not decided yet for how long we will be supporting provifders for Airflow 2.

This is all possible. Provider-info is extendable - so if provider exposes new entries in provider-info, they will be ignored by Airflow 2's providers manager. So basically we need to make sure that sql providers (and common.sql provider) installed on Airlfow 2 will continue to work. This should be actually even automatically tested by our "Provider compatibility check" - which work in the way that they build provider packages, install them for Airflow 2.8, 2.9, 2.10 (and 2.11 in the future) and run test suite from main to test if they continue to work. The test suite will need to be written in the way to handle those tests - so for example all "dialect discovery" tests should be skipped when Airflow 2 is used - this is all described in https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#compatibility-provider-unit-tests-against-older-airflow-releases

Hello @potiuk, ok, I've updated the code so it has a fallback mechanism if Airflow isn't 3.0.0 yet. I also added a test which will start to fail once min airflow version for common sql provider is 3.0.0 so we don't forget to remove the code for backward compatibility.

Oct 08 '24 08:10 dabla

Currently the version of the common sql provider is 1.17.1, I suppose this now will have to change to 1.18.0? If so how do I do this for one provider?

This can be done by adding >= in provider.yaml in dependencies (apache-airflow-providers-common-sql>=1.18.0 - it's all that it needs.

What I meant here was how do I update the apache-airflow-providers-common-sql version to 1.18.0?

Oct 08 '24 08:10 dabla

I have something weird going on with mysql integration tests, this PR hasn't changed anything (directly at least) related to MySQL:

Warnings saved into /files/warnings-operators-mysql.txt file.
=========================== short test summary info ============================
FAILED tests/operators/test_generic_transfer.py::TestMySql::test_mysql_to_mysql[mysqlclient] - MySQLdb.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'schema, login, password, port, is_encrypted, is_extra_encrypted, extra) VALUES (' at line 1")
FAILED tests/operators/test_generic_transfer.py::TestMySql::test_mysql_to_mysql[mysql-connector-python] - MySQLdb.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'schema, login, password, port, is_encrypted, is_extra_encrypted, extra) VALUES (' at line 1")

I've enabled logging of generated insert SQL statement used by the insert_rows method in GenericTransfer:

INFO  [airflow.task.hooks.airflow.providers.mysql.hooks.mysql.MySqlHook] Generated sql: INSERT INTO test_mysql_to_mysql (id, conn_id, conn_type, description, host, schema, login, password, port, is_encrypted, is_extra_encrypted, extra) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
ERROR [airflow.task] Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/airflow/models/taskinstance.py", line 761, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/airflow/models/taskinstance.py", line 727, in _execute_callable
    return ExecutionCallableRunner(
  File "/opt/airflow/airflow/utils/operator_helpers.py", line 258, in run
    return self.func(*args, **kwargs)
  File "/opt/airflow/airflow/models/baseoperator.py", line 407, in wrapper
    return func(self, *args, **kwargs)
  File "/opt/airflow/airflow/operators/generic_transfer.py", line 108, in execute
    insert_rows(table=self.destination_table, rows=results, **self.insert_args)
  File "/opt/airflow/airflow/providers/common/sql/hooks/sql.py", line 696, in insert_rows
    cur.execute(sql, values)
  File "/usr/local/lib/python3.9/site-packages/MySQLdb/cursors.py", line 179, in execute
    res = self._query(mogrified_query)
  File "/usr/local/lib/python3.9/site-packages/MySQLdb/cursors.py", line 330, in _query
    db.query(q)
  File "/usr/local/lib/python3.9/site-packages/MySQLdb/connections.py", line 261, in query
    _mysql.connection.query(self, query)
MySQLdb.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'schema, login, password, port, is_encrypted, is_extra_encrypted, extra) VALUES (' at line 1")
INFO  [airflow.models.taskinstance] Marking task as FAILED. dag_id=unit_test_dag, task_id=test_m2m, run_id=manual__2015-01-01T00:00:00+00:00, execution_date=20150101T000000, start_date=20241009T190454, end_date=20241009T190454

The error makes sense as indeed "schema" is a reserved word and should be escaped, but when looking in the MySQLHook I didn't find any code which escapes column names in case it's a reserved word? So how come this suddenly starts failing?

Found the isue.

Oct 10 '24 06:10 dabla

@eladkal @potiuk I think the dialects are ready, what do you guys think?

A new feature within this PR is that if the developer doesn't need to specify the target_fields anymore, those can be resolved automatically through the inspector of the sqlalchemy engine or the specific dialect if it exists. This feature is disabled by default so we don't have an impact on the current behaviour in Airflow. This is done through the "core.dbapihook_resolve_target_fields" configuration parameter I added. This feature is handy, as you need to specify the target_fields when you want to use the replace (e.g. upsert) functionality in MSSQL and Postgres, so to avoid to have to pass all target_fields manually, which can be a pain when you have a table with lot's of columns, the DbApiHook can do it for you as it will try to resolve all target_fields which are insertable (not an identity or autoincrement field or primary key).

Also when generating the INSERT/UPSERT sql, the dialect will check if the column name is a reserved word and escape it for the corresponding dialect, the escape format can be changed through the "_escape_column_name_format" property of the DbApiHook.

We have tested those functionalities on our patched Airflow infra for MySql, MsSql, Postgres and SAP Hana.

Oct 16 '24 17:10 dabla

That's a big one. There are two comments here:

We are rethinking packaging/ providers/plugins APIs and very likely we are going to have several different "types" of Airflow extensions - separate secret backends, separate "operators/hooks providers", separate log handlers, separate UI plugins. And while this one looks like classic "operator/hooks type provider" I think we should decide what to do before we merge that one (we are going to discuss it today at the dev call and discussion will start shortly in devlist I think)

cc: @kaxil @ashb @vikramkoka

I think that one is good as a "base" PR, but once this one is green and we more or less know that the common part is stable, you should split that one into first adding the functionality to provider's manager and other common parts (and maybe even split that one), followed by separate change for every provider. That would make it WAY easier to review and is relatively easy to do - this is the favourite way for me to do such changes - because once you get the common part merged, you rebase the "big" PR and it gest smaller - and it gets smaller and smaller with every provider added later until it disappears.

This helps with fast and effective reviews of such PRs.

Oct 24 '24 13:10 potiuk

That's a big one. There are two comments here:

We are rethinking packaging/ providers/plugins APIs and very likely we are going to have several different "types" of Airflow extensions - separate secret backends, separate "operators/hooks providers", separate log handlers, separate UI plugins. And while this one looks like classic "operator/hooks type provider" I think we should decide what to do before we merge that one (we are going to discuss it today at the dev call and discussion will start shortly in devlist I think)

cc: @kaxil @ashb @vikramkoka

I think that one is good as a "base" PR, but once this one is green and we more or less know that the common part is stable, you should split that one into first adding the functionality to provider's manager and other common parts (and maybe even split that one), followed by separate change for every provider. That would make it WAY easier to review and is relatively easy to do - this is the favourite way for me to do such changes - because once you get the common part merged, you rebase the "big" PR and it gest smaller - and it gets smaller and smaller with every provider added later until it disappears.

This helps with fast and effective reviews of such PRs.

Hello Jarek, how do you see the second point? Shall I start a new PR which only contains main changes to support dialects in common part?

Oct 24 '24 13:10 dabla

Hello Jarek, how do you see the second point? Shall I start a new PR which only contains main changes to support dialects in common part?

Yep

Oct 24 '24 20:10 potiuk

Also - for reference see this PR from @jscheffl :

Original PR: https://github.com/apache/airflow/pull/41729

It has been closed when all the "split" PRs have been merged)

Resulting PRs:

https://github.com/apache/airflow/pull/41730
https://github.com/apache/airflow/pull/41731
https://github.com/apache/airflow/pull/42046
https://github.com/apache/airflow/pull/42047
https://github.com/apache/airflow/pull/42048
https://github.com/apache/airflow/pull/42049
https://github.com/apache/airflow/pull/42050
https://github.com/apache/airflow/pull/42051
https://github.com/apache/airflow/pull/43139

(Maybe I do not have all of that right - but this is more or less the idea @dabla )

Oct 25 '24 19:10 potiuk

Also - for reference see this PR from @jscheffl :

Original PR: #41729

It has been closed when all the "split" PRs have been merged)

Resulting PRs:

AIP-69: Airflow Core adjustments for introduction of Edge Executor #41730

AIP-69: Breeze adjustments for introduction of Edge Executor #41731

AIP-69: Adding Empty Edge Provider Package #42046

AIP-69: Adding Edge Provider DB Models #42047

AIP-69: Add Executor to Edge Provider #42048

AIP-69: Add API and Plugin to Edge Provider #42049

AIP-69: Add CLI to Edge Provider #42050

AIP-69: Add leftover glue of all pieces to Edge Provider #42051

AIP-69: Breeze adjustments for introduction of Edge Executor (#41731) #43139

(Maybe I do not have all of that right - but this is more or less the idea @dabla )

Thanks Jarek, this is indeed a nice split and good example. I was thinking of first creating a PR defining the dialect in the providers manager and defining the types but without actually implementing it. Anyway will try to split it up in as much little PRs as posisble.

Oct 25 '24 19:10 dabla

I now have some error regarding an invalid import, but other test cases also import the same without any issues:

________ ERROR collecting providers/tests/common/sql/hooks/test_sql.py _________
ImportError while importing test module '/opt/airflow/providers/tests/common/sql/hooks/test_sql.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/local/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
providers/tests/common/sql/hooks/test_sql.py:37: in <module>
    from tests_common.test_utils.compat import AIRFLOW_V_2_8_PLUS
E   ImportError: cannot import name 'AIRFLOW_V_2_8_PLUS' from 'tests_common.test_utils.compat' (/opt/airflow/tests_common/test_utils/compat.py)

Nov 08 '24 09:11 dabla

Needs rebase now.

Nov 11 '24 21:11 potiuk

Needs rebase now.

Do you think additional PR will be needed once 43747 is merged? Because then only the dialects will be here and the adapted corresponding hooks. We could split up per dialect implementation (default/mssql/postgres/...)

Nov 12 '24 10:11 dabla

@potiuk For the backward compatibility, I was wondering maybe we should ditch it here, and only support it for Airflow 3.x The dialects will work any for native hooks like mssql, mysql and postgres, but wouldn't for odbc/jdbc unless you use Airflow 3.x. So it would mean everything would work as it worked before, unless you want dialect support with odbc/jdbc, then you'll need Airflow 3.x. What do you think, because the we could remove the ugly backward compat code for Airflow 2.x and it would also be a good reason to migrate to Airflow 3.x? I'm refering to this code which I personally don't like very much.

Dec 09 '24 20:12 dabla

Yes. it could be Airflow 3-only feature. As long as the providers as a whole work for Airlfow 2 as well and we document it well I agree this could be a nice feature of the provider only enabled by migration to Airflow 3. We should have quite a few such incentives.

Dec 09 '24 20:12 potiuk

Yes. it could be Airflow 3-only feature. As long as the providers as a whole work for Airlfow 2 as well and we document it well I agree this could be a nice feature of the provider only enabled by migration to Airflow 3. We should have quite a few such incentives.

I was wrong for the last part, it can stay as is, I thought the resolve_dialect methods directly returned the dialect but it's not, it returns a dataclass (e.g. dict) DialectInfo. So I think it's ok as it is if it's ok for you.

Dec 10 '24 08:12 dabla

This is really nice- finally had a chance to review it. @dabla - can you rebase it pleaase :)

Dec 31 '24 15:12 potiuk

This is really nice- finally had a chance to review it. @dabla - can you rebase it pleaase :)

Thanks @potiuk just rebased it ;) and have a good new year!!!

Dec 31 '24 19:12 dabla

Well. We have the saying ... what happens for you on New Years Eve - you will do the whole year :)

Happy New Year

Dec 31 '24 20:12 potiuk

@dabla it would be great to add docs to explain what dialects are, what problem it solves and how to use it. Most of this information is already mentioned in the PR so it just need to be added to the common.sql docs

Jan 01 '25 07:01 eladkal

@dabla it would be great to add docs to explain what dialects are, what problem it solves and how to use it. Most of this information is already mentioned in the PR so it just need to be added to the common.sql docs

Good point!

Jan 01 '25 09:01 potiuk

@dabla it would be great to add docs to explain what dialects are, what problem it solves and how to use it. Most of this information is already mentioned in the PR so it just need to be added to the common.sql docs

Yeah will create a new PR for that, also need to explain the new Airlow config parameter (disabled by default) which automatically fills the target_fields if not specified.

Could one of you guys also check my question in the other PR regarding the GenericTransfer operator?

Jan 01 '25 09:01 dabla

Could one of you guys also check my question in the other PR regarding the GenericTransfer operator?

Can you rebase it and mention us there :) ? Then it will be bumped on top of my list :)

Jan 01 '25 10:01 potiuk