Status of testing Providers that were prepared on August 10, 2022
Body
I have a kind request for all the contributors to the latest provider packages release. Could you please help us to test the RC versions of the providers?
Let us know in the comment, whether the issue is addressed.
Those are providers that require testing as there were some substantial changes introduced:
Provider amazon: 5.0.0rc3
- [ ] Avoid requirement that AWS Secret Manager JSON values be urlencoded. (#25432): @dwreeves
- [ ] Remove Amazon deprecated modules (#25543): @vincbeck
- [ ] Resolve Amazon Hook's
region_nameandconfigin wrapper (#25336): @Taragolis - [ ] Resolve and validate AWS Connection parameters in wrapper (#25256): @Taragolis
- [ ] Standardize AwsLambda (#25100): @eladkal
- [ ] Refactor monolithic ECS Operator into Operators, Sensors, and a Hook (#25413): @ferruzzi
- [ ] Remove deprecated modules from Amazon provider package (#25609): @vincbeck
- [ ] Add EMR Serverless Operators and Hooks (#25324): @syedahsn
- [ ] Hide unused fields for Amazon Web Services connection (#25416): @Taragolis
- [ ] Enable Auto-incrementing Transform job name in SageMakerTransformOperator (#25263): @celeriev
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
- [ ] SQSPublishOperator should allow sending messages to a FIFO Queue (#25171): @dbarrundiag
- [ ] Glue Job Driver logging (#25142): @ferruzzi
- [ ] Bump typing-extensions and mypy for ParamSpec (#25088): @uranusjr
- [ ] Enable multiple query execution in RedshiftDataOperator (#25619): @pankajastro
- [ ] Fix S3Hook transfer config arguments validation (#25544): @Taragolis
- [ ] Fix BatchOperator links on wait_for_completion = True (#25228): @Taragolis
- [ ] SqlToS3Operator: change column type from object to str in dataframe (#25083): @pastanton
- [x] Deprecate usage of
extra[host]in AWS's connection (#25494): @gmcrocetti - [ ] Get boto3.session.Session by appropriate method (#25569): @Taragolis
Provider apache.drill: 2.2.0rc3
Provider apache.druid: 3.2.0rc3
Provider apache.hdfs: 3.1.0rc3
- [ ] Adding Authentication to webhdfs sensor (#25110): @ankurbajaj9
Provider apache.hive: 4.0.0rc3
- [ ] Deprecate hql parameters and synchronize DBApiHook method APIs (#25299): @potiuk
- [x] Remove Smart Sensors (#25507): @ashb
Provider apache.livy: 3.1.0rc3
- [ ] Add auth_type to LivyHook (#25183): @bdsoha
Provider apache.pinot: 3.2.0rc3
- [ ] Deprecate hql parameters and synchronize DBApiHook method APIs (#25299): @potiuk
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
Provider cncf.kubernetes: 4.3.0rc3
- [ ] Improve taskflow type hints with ParamSpec (#25173): @uranusjr
- [ ] Fix xcom_sidecar stuck problem (#24993): @MaksYermak
Provider common.sql: 1.1.0rc3
- [ ] Improve taskflow type hints with ParamSpec (#25173): @uranusjr
- [ ] Move all "old" SQL operators to common.sql providers (#25350): @potiuk
- [ ] Deprecate hql parameters and synchronize DBApiHook method APIs (#25299): @potiuk
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
- [ ] Common SQLCheckOperators Various Functionality Update (#25164): @denimalpaca
- [ ] Allow Legacy SqlSensor to use the common.sql providers (#25293): @potiuk
- [ ] Fix common sql DbApiHook fetch_all_handler (#25430): @FanatoniQ
- [ ] Align Common SQL provider logo location (#25538): @josh-fell
Provider databricks: 3.2.0rc3
- [ ] Databricks: update user-agent string (#25578): @alexott
- [ ] More improvements in the Databricks operators (#25260): @alexott
- [ ] Improved telemetry for Databricks provider (#25115): @alexott
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
- [ ] Databricks: fix test_connection implementation (#25114): @alexott
- [ ] Do not convert boolean values to string in deep_string_coerce function (#25394): @jgr-trackunit
- [ ] Correctly handle output of the failed tasks (#25427): @alexott
Provider dbt.cloud: 2.1.0rc3
- [ ] Improve taskflow type hints with ParamSpec (#25173): @uranusjr
Provider elasticsearch: 4.2.0rc3
- [ ] Improve ElasticsearchTaskHandler (#21942): @millin
Provider exasol: 4.0.0rc3
- [ ] Deprecate hql parameters and synchronize DBApiHook method APIs (#25299): @potiuk
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
Provider google: 8.3.0rc3
- [ ] add description method in BigQueryCursor class (#25366): @sophiely
- [ ] Add project_id as a templated variable in two BQ operators (#24768): @leahecole
- [ ] Remove Amazon deprecated modules (#25543): @vincbeck
- [ ] Move all "old" SQL operators to common.sql providers (#25350): @potiuk
- [ ] Improve taskflow type hints with ParamSpec (#25173): @uranusjr
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
- [ ] Bump typing-extensions and mypy for ParamSpec (#25088): @uranusjr
- [ ] Deprecate hql parameters and synchronize DBApiHook method APIs (#25299): @potiuk
- [ ] New Operators for the Google Cloud Dataform service (#25587): @lwyszomi
- [ ] Fix an issue on document (#25614): @Corea
- [ ] Fix BigQueryInsertJobOperator cancel_on_kill (#25342): @lidalei
- [ ] Fix BaseSQLToGCSOperator approx_max_file_size_bytes (#25469): @dclandau
- [ ] Fix PostgresToGCSOperat bool dtype (#25475): @dclandau
- [ ] Fix Vertex AI Custom Job training issue (#25367): @MaksYermak
- [ ] Fix Flask Login user setting for Flask 2.2 and Flask-Login 0.6.2 (#25318): @potiuk
Provider hashicorp: 3.1.0rc3
Provider jdbc: 3.2.0rc3
- [ ] Fixing JdbcOperator non-SELECT statement run (#25412): @kazanzhy
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
Provider microsoft.azure: 4.2.0rc3
- [ ] Add
test_connectionmethod to AzureContainerInstanceHook (#25362): @phanikumv - [ ] Add test_connection to Azure Batch hook (#25235): @phanikumv
- [ ] Bump typing-extensions and mypy for ParamSpec (#25088): @uranusjr
- [ ] Implement Azure Service Bus (Update and Receive) Subscription Operator (#25029): @bharanidharan14
- [ ] Set default wasb Azure http logging level to warning; fixes #16224 (#18896): @havocbane
Provider microsoft.mssql: 3.2.0rc3
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
- [ ] Fix MsSqlHook get_uri, adding pymssql driver to scheme (25092) (#25185): @FanatoniQ
Provider mysql: 3.2.0rc3
Provider neo4j: 3.1.0rc3
- [ ] Add documentation for July 2022 Provider's release (#25030): @potiuk
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
Provider odbc: 3.1.1rc3
- [ ] Fix odbc hook sqlalchemy_scheme docstring (#25421): @FanatoniQ
Provider oracle: 3.3.0rc3
Provider postgres: 5.2.0rc3
- [ ] Use only public AwsHook's methods during IAM authorization (#25424): @Taragolis
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
Provider presto: 4.0.0rc3
- [ ] Remove
PrestoToSlackOperator(#25425): @eladkal - [ ] Deprecate hql parameters and synchronize DBApiHook method APIs (#25299): @potiuk
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
Provider qubole: 3.2.0rc3
- [ ] Make extra link work in UI (#25500): @uranusjr
- [ ] Move all "old" SQL operators to common.sql providers (#25350): @potiuk
- [ ] Improve taskflow type hints with ParamSpec (#25173): @uranusjr
- [ ] Correctly render
results_parser_callableparameter in Qubole operator docs (#25514): @josh-fell
Provider salesforce: 5.1.0rc3
- [ ] Improve taskflow type hints with ParamSpec (#25173): @uranusjr
Provider snowflake: 3.2.0rc3
- [ ] Move all "old" SQL operators to common.sql providers (#25350): @potiuk
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
Provider sqlite: 3.2.0rc3
Provider trino: 4.0.0rc3
- [ ] Deprecate hql parameters and synchronize DBApiHook method APIs (#25299): @potiuk
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
Provider vertica: 3.2.0rc3
- [ ] Optimize log when using VerticaOperator (#25566): @sudohainguyen
- [ ] Unify DbApiHook.run() method with the methods which override it (#23971): @kazanzhy
Provider yandex: 3.1.0rc3
The guidelines on how to test providers can be found in
Verify providers by contributors
Committer
- [X] I acknowledge that I am a maintainer/committer of the Apache Airflow project.
All good regarding extra['host'] deprecation for amazon
Tested the below two and works fine with microsoft.azure: 4.2.0rc3
https://github.com/apache/airflow/pull/25235: @phanikumv

https://github.com/apache/airflow/pull/25362: @phanikumv

Tested both Azure Service Bus (Update and Receive) Subscription Operator, working fine 👍
https://github.com/apache/airflow/pull/25029: @bharanidharan14
Hi!
I've found an issue with databricks provider, at first glance it looks that it's related to: https://github.com/apache/airflow/pull/25115
@alexott I think it might be interesting for you.
More info below:
[2022-08-11, 11:10:43 UTC] {{standard_task_runner.py:91}} ERROR - Failed to execute job 34817 for task xxxxx
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/task/task_runner/standard_task_runner.py", line 85, in _start_by_fork
args.func(args, dag=self.dag)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/utils/cli.py", line 92, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 292, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 107, in _run_task_by_selected_method
_run_raw_task(args, ti)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 184, in _run_raw_task
error_file=args.error_file,
File "/usr/local/lib/python3.7/site-packages/airflow/utils/session.py", line 70, in wrapper
return func(*args, session=session, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1332, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1458, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1514, in _execute_task
result = execute_callable(context=context)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/databricks/operators/databricks.py", line 374, in execute
self.run_id = self._hook.submit_run(self.json)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/databricks/hooks/databricks.py", line 152, in submit_run
response = self._do_api_call(SUBMIT_RUN_ENDPOINT, json)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/databricks/hooks/databricks_base.py", line 493, in _do_api_call
headers = {**self.user_agent_header, **aad_headers}
File "/usr/local/lib/python3.7/site-packages/cached_property.py", line 36, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/databricks/hooks/databricks_base.py", line 136, in user_agent_header
return {'user-agent': self.user_agent_value}
File "/usr/local/lib/python3.7/site-packages/cached_property.py", line 36, in __get__
value = obj.__dict__[self.func.__name__] = self.func(obj)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/airflow/providers/databricks/hooks/databricks_base.py", line 144, in user_agent_value
if provider.is_source:
AttributeError: 'ProviderInfo' object has no attribute 'is_source'
I ran it on MWAA == 2.2.2 with below configuration:
new_cluster = {
"autoscale": {"min_workers": 1, "max_workers": 2},
"cluster_name": "",
"spark_version": get_spark_version(),
"spark_conf": Variable.get("SPARK_CONF", deserialize_json=True, default_var="{}"),
"aws_attributes": {
"first_on_demand": 1,
"availability": "SPOT_WITH_FALLBACK",
"zone_id": "auto",
"instance_profile_arn": Variable.get("E2_INSTANCE_PROFILE_ARN", default_var=""),
"spot_bid_price_percent": 100,
},
"enable_elastic_disk": True,
"node_type_id": "r5a.xlarge",
"ssh_public_keys": [],
"custom_tags": {"Application": "databricks", "env": env, "AnalyticsTask": "task name"},
"spark_env_vars": {},
"cluster_source": "JOB",
"init_scripts": [],
}
with DAG(
dag_id="dag id",
description="desc",
default_args=default_args,
schedule_interval="0 2 * * *", # Every night at 02:00
catchup=False,
max_active_runs=1,
concurrency=1,
is_paused_upon_creation=dag_is_paused_upon_creation,
) as dag:
task = DatabricksSubmitRunOperator(
task_id="task-name",
databricks_conn_id="connection-name",
new_cluster=new_cluster,
notebook_task="notebook task",
timeout_seconds=3600 * 4, # 4 hours
polling_period_seconds=30,
retries=1,
)
Tell me if you need more detail.
@jgr-trackunit oh, this field was introduced in 2.3.0 :-( I think that I need to fix it before releasing
@potiuk unfortunately, Databricks provider became incompatible with 2.2. I'm preparing a fix for it, but it will be separate release. Sorry for adding more work on you :-(
No problem. Good to know :). This is what testing is about @alexott :).
Thanks @jgr-trackunit for spotting it.
@potiuk I am not sure if this is the right place to put it, but here is the deal:
https://github.com/apache/airflow/pull/24554 added num_batches parameter to SQSSensor in Amazon provider.
You did ask me as a contributor of the feature to test it for AWS provider amazon: 4.1.0rc1 in https://github.com/apache/airflow/issues/25037#event-7006630694
Unfortunately I did not have time to do so in time for the release.
Taking advantage of this 5.0.0 release for Amazon Providers, I tested the feature.
What have been tested
Given SQSSensor without num_batches and the same sensor with num_batches, ensure that:
-
Both sensors can get messages normally from SQS
-
SQSSensor with
num_batchesactually get batches of messages instead of a single poke -
Version of AWS provider

Result
- The sensor
read_from_queue = SqsSensor(
aws_conn_id="aws_sqs_test",
task_id="read_from_queue",
sqs_queue=sqs_queue,
)
# Retrieve multiple batches of messages from SQS.
# The SQS API only returns a maximum of 10 messages per poll.
read_from_queue_in_batch = SqsSensor(
aws_conn_id="aws_sqs_test",
task_id="read_from_queue_in_batch",
sqs_queue=sqs_queue,
# Get maximum 10 messages each poll
max_messages=3,
# Combine 3 polls before returning results
num_batches=3,
)
- The result of task execution (success)

- The result of
SQSSensorwithoutnum_batchesenabled. Only a few messages are available inxcom

- The result of
SQSSensorwithnum_batches=3andmax_messages=3. It does get3 x 3 = 9messages for each execution.
opened #25674 to fix issue with DB provider
@LaPetiteSouris - thank you ! This is cool to get it confirmed even now !
Found another issue with Databricks provider - DBSQL operator doesn't work anymore, most probably caused by #23971 - it looks like split_sql_string doesn't handle correctly simple SQL queries (like select * from default.a_events limit 10 that I'm using in tests). for this one I need to get more time debugging it
Found another issue with Databricks provider - DBSQL operator doesn't work anymore, most probably caused by #23971 - it looks like
split_sql_stringdoesn't handle correctly simple SQL queries (likeselect * from default.a_events limit 10that I'm using in tests). for this one I need to get more time debugging it
No worries @alexott - I will anyhow has to wait with rc4 for databricks till after this voting completes.
My concern that this change in the common-sql may affect other packages - I see it in the Drill, Exasol, Presto,
Hashicorp provider change appears to be working as expected for me.
My concern that this change in the common-sql may affect other packages - I see it in the Drill, Exasol, Presto,
If there is a dag/operator you want to verify with Presto you can add it here and I'll check
I don’t have something to test, but I’m concerned that if it broke databricks sql, may it break other as well?
Webhdfs worked for me
I don’t have something to test, but I’m concerned that if it broke databricks sql, may it break other as well?
@alexott Can you mae a PR fixing it in databricks so that we can see how the problem manifests? I can take a look at others and asses if the potential of breaking it for other providers is there?
Yes, will do, most probable on Saturday...
Just looked it up @alexott -> I do not think it is breaking other providers (@kazanzhy to confirm).
Only Databricks and Snowflake hooks have split_statements set to True by defauilt (because historically they were doing the split). All the others have split_statements = False as default - so the change is not breaking, for them. It might not work to split some statements from some DBs using the common method introduced by Dmitro - but it's not breaking as this is a new feature for them. The snowflake one actually does not use this new method - it continues to use it's own snowflake-internal "util.split_statements" so it is unaffected. The Databricks one is the only one with default split_statements = True using this common method.
BTW. Loking at the change I think the problem might be when the query contains ; followed by whitespace and EOL after. The old regexp and .strip() would remove such "empty" statement where the new one would likely not do it.
This is the method introduced:
@staticmethod
def split_sql_string(sql: str) -> List[str]:
"""
Splits string into multiple SQL expressions
:param sql: SQL string potentially consisting of multiple expressions
:return: list of individual expressions
"""
splits = sqlparse.split(sqlparse.format(sql, strip_comments=True))
statements = [s.rstrip(';') for s in splits if s.endswith(';')]
return statements
Thank you for looking into it. I'll debug it. My query is just single select without ;, and split_sql_string returns empty list for it
Right - I see. I think the mistake is that it should be (@kazanzhy ?) :
statements = [s.rstrip(';') if s.endswith(';') else s.strip() for s in splits if s.strip() != ""]
That would actually make me think to remove common.sql and all the dependent packages and release rc4 together because indeed any query without ";" passed with "split_statement" will not work, which makes it quite problematic.
Update: added handling whitespace that "potentially" might bre returned (though this is just defensive -> sqlparse.split() should handle it, but better to be safe than sorry. Also whether this is a buf or not depends a bit on sqlparse's behaviour.
Yep. Confirmed this looks like a bug for all SQL - probably safer to make rc4 for all of them. Thanks @alexott for being vigilant :) - @kazanzhy - will you have time to take a look and double-check my findings and fix it before Monday ?
>>> sqlparse.split(sqlparse.format('select * from 1', strip_comments=True))
['select * from 1']
>>> splits = sqlparse.split(sqlparse.format('select * from 1', strip_comments=True))
>>> print(splits)
['select * from 1']
>>> splits = sqlparse.split(sqlparse.format('select * from 1', strip_comments=True))
>>> print(splits)
['select * from 1']
>>> [s.rstrip(';') for s in splits if s.endswith(';')]
[]
>>> [s.rstrip(';') if s.endswith(';') else s.strip() for s in splits if s.strip() != ""]
['select * from 1']
>>>
I tested all my changes (mostly checking if the code I moved around is there). Looking for more tests :)
@kazanzhy -> I will remove the common.sql and relatedd providers to get RC4 and if I don't hear from you till Monday, I will attempt to fix the problem found by @alexott and prepare an RC4
@potiuk is the PR open for it? If yes, I can test it tomorrow morning...
Tested https://github.com/apache/airflow/pull/25619 working as expected
I'm currently testing my implementation for the Amazon Provider package change #25432.
The one thing I have noticed so far is that the Connection object returned is missing a conn_id. Oops! I will implement a PR that adds it in + tests that it's there.
I didn't notice any other issues.
Another thing I learned while testing-- MWAA (AWS managed Airflow service) locks to the provider package version 2.4.0. It may be worth double checking the documentation clarifies the requirement to not URL-encode is a new 5.0.0 feature. There may be a lot of overlap between people using the provider package and people using MWAA. They may find it confusing to see the latest version of the documentation mentions you can do something that doesn't work in their environment.
Another thing I learned while testing-- MWAA (AWS managed Airflow service) locks to the provider package version 2.4.0. It may be worth double checking the documentation clarifies the requirement to not URL-encode is a new 5.0.0 feature. There may be a lot of overlap between people using the provider package and people using MWAA. They may find it confusing to see the latest version of the documentation mentions you can do something that doesn't work in their environment.
The nice thin that the docs in providers is nice linked in the UI to the version that is installed. I think we also have nice changelog describing the difference, I am not sure if we need to do more. But any docs clarifications are welcome :)