airflow
airflow copied to clipboard
Configure HttpHook's auth_type from Connection
Hello,
This PR makes possible to setup and parameterize HttpHook and HttpAsyncHook's auth_type
from the Connection UI.
Concretely, this PR:
- Add two extra settings in Http 'Extra' Connection:
- The reserved
auth_type
field to define a auth class - The reserved
auth_kwargs
field to provide a dict of extra parameters to theauth_type
class.
- The reserved
- The
auth_type
is validated against a list of Auth classes, to protect against code injection. - The list of Auth classes can be customized:
- Via the airflow_local_settings.
- Via the new
AIRFLOW__HTTP__EXTRA_AUTH_TYPES
config
- The changes are applied to both HttpHook and HttpAsyncHook, via a new common mixin class.
- The UI of the Http Connection is refactored:
- To make those Auth classes more discoverable, with a SelectField for
auth_type
- To have a convenient dedicated field for
auth_kwargs
- To make those Auth classes more discoverable, with a SelectField for
Side effect of the UI changes: The Extra field was until now used to pass params to the Headers (anything in the Extra was passed to the Headers). But now, auth_kwargs
and auth_type
are also being written over there, which I don't find very convenient. Furthermore, this PR add logic to exclude those keys from the Headers (IMO this start to be a bit of tech-debt). And finally, user cannot pass a header named like those keys (it is unlikely, but it could happen).
Thus, I propose to deprecate headers parameters passed directly in the 'Extra' field. And to pass them via a dedicated "Headers" field.
UI:
Side effect:
I also tried to add a CodeMirrorField (for Headers and Auth kwargs), and a CollapsibleField (to hide Extra), but it was a bit too much compared to the initial goal of this PR. Maybe in a future one.
Use-case:
The auth_type
is typically a subclass of request.AuthBase
. Many custom Auth classes exist for many different protocols. Sometimes, passing only two hard-coded conn.username
and conn.password
is not enough: The Auth class expects more than two arguments.
Examples:
- OAuth2 from requests-oauthlib, which can take a token as third argument
- HTTPKerberosAuth from requests-kerberos, which can take much more than two arguments
Right now, to deal with those cases, they are three possibilities:
- Using
functools.partial
, like mentioned in this PR, in the dag file / in the operator declaration.
Opinion: The dag developer should not care about handling the connection. He just want a working connection_id to call an endpoint (especially if its a beginner / low-experienced dev). Furthermore, some parameters are sensitive and cannot be written in a dag. - Writing a custom Hook, which dispatch the parameters from the Connection correctly (eventually using partial).
Opinion: This is not okay. Other hooks are doing better. Take the ODBCHook, which allows to parameterize every aspect of the connection without subclassing anything:- Defining which driver has to be used
- Defining connection schema for SQLAlchemy
- Parameterize the driver via extra driver-specific parameters
- Parameterize the behavior of pyodbc (
"connect_kwargs"
).
- Misusing the "username" and "password" fields of a Connection, to stack and pass multiple parameters in it + implementing a thin layer on top of a Auth class to re-dispatch the parameters.
Opinion: This is definitively a bad workaround. I'm mentioning it because this PR won't entirely solve the issue, and this may (continue to) happen.
Coming to this PR, I propose to add two reserved field: "auth_type" and "auth_kwargs", which are passed to the underlying Auth class. No breaking change. This solve most of the issues: a partial is not needed anymore, a subclass is not needed anymore, and there are less cases where conn.username and conn.password will be misused.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst
or {issue_number}.significant.rst
, in newsfragments.
This looks good in principle (small and simple, yet powerful), but it would require a more complete documentation and examples in order to be mergable.
@potiuk What about a more aggressive PR, with most likely a breaking change ? (Sorry for the big chunk of text, here is the important part:)
This PR could copy further what the ODBCHook does, to:
- Implement a way to set a custom Auth class from the Connection UI / conn.extra_json.
- Remove the two hard-coded conn.username and conn.password from class instantiation to replaces them by keywords arguments, eventually customizable in conn.extra_json too.
- Implement a way to customize proxy, ssl verify, and all other extra requests parameters from the Connection UI / conn.extra_json.
Would that be okay ?
- Implement a way to set a custom Auth class from the Connection UI / conn.extra_json.
I'd say I am not so thrilled by the other option. and Certainly would not comment on it unless you show the code rather than explain in words what you really mean by changing it. I am not sure if you can pass in words what you want to do do without actually trying and implementing POC where you would show the code and we could assess how "breaking" it is. Airflow is used by 10s of thousands of enterprises and we cannot afford breaking changes that will make everyones workflows broken when migrating. We can break few peple workflows (this is inevitable) but not everyoene's
So before even attempting that, you should answer yourself a few questions. And decide if you want to go there at all.
-
Will that make everyone's connection stop working and require them to manually modify their connnections ? -> certainly NOT - probably not even in Airflow 3 (which we do not even plan yet).
-
Will there be change in any of the Public APIs that you can use to access connections ? -> absolutely notv in Airflow 2. This is forbidden by SemVer and we cannot change the API (incliuding Connection Python object that is retrieved by Hooks and operators) deliberately in Airflow 2.
-
Will there be any migration (automated or not) for the users? Will it handle all the cases?
-
what exactly do you want to achieve and Why you think is benefitial - i.e. is the chnge cost and potential problems justified by the benefit ?
-
How will it impact the UI and ways of defining connections via other mechanisms (env vars, secrets? )
....
I think most of thos questions are only worth looking in detail when there is at least Proof-Of-Concept where discussion can be done over the code rather than abstract concept of the change :) . Otherwise It will take too much time of those who review it to understand what you really want to do - having a code to look at is pretty much starting point of someone looking at proposing a change touching this part - part that is pretty much "core" of Airflow and part of Public Interface of Airflow: https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html
Thanks for the detailed answer !
I won't go for proxy settings and extra parameters in this PR. Just for info, adding a tool like forwarder solve globally all proxy configuration issues. And disabling ssl "verify" can be done via the CURL_CA_BUNDLE
trick instead of adding parameters to control that in the Connection.
For the rest, currently this PR:
- Allows to configure the Auth class from the Connection
- Allows to pass extra parameters to instantiate the Auth class from the Connection
There is no breaking change in Airflow. It may be a breaking change for user relying on the previous logic of the property (see below's code-review). But as the property was introduced recently, that won't break many users' worflows.
i just realised there is likely one big problem here - security.
While we cannot prevent it completely for some kind of connections (this is why Connection Editing user should be highly priviledged, introducing RCE deliberately is another thing.
If I understand correctly, someone who edits connection can decide which arbtirary class will be instantiated and executed when HTTP connection is established via HTTP Hook ? Which - if I understand correctly is basically a "no-go" - we removed a number of cases like that from the past from a number of providers precisely for that reason.
Is there any way we can make UI connection "declarative" for that? for example we could limit the list of predefined auth types we can choose. Does it make sense at all?
Makes total sense ! I added a AIRFLOW__HTTP__EXTRA_AUTH_TYPES
parameter for the Airflow 'Deployment manager' to control which classes can be set as "auth_type"
. By default, only classes from requests.auth
and the ones added via this parameter can be used. If any other class is setup, the following message will appear in the logs, when executing the task:
Still WIP: I'm looking into customizing the UI to have special fields based on the extra params.
Alright, aside from a last issue with the non-DB test, this PR is ready for review. I updated the first post with current status.
Yes. Conceptually this is much better. I would need to take a closer look but first glance it looks good. Can you fix the last test, and I can take a look then,
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.
Here is a screenshot of the updated HTTP (and Livy) connection form. I had to update the javascript file of connection_form.js to not only enable CodeMirror on the extra field but also on the other possible user defined textarea fields.
The extra field will be hidden, here I kept it visible for debugging purposes. Also the description textarea field became a CodeMirror widget but in the final version that one is skipped and stays as a regular textarea.
I think this one is finally ready
I think this one is finally ready
can you undraft it then ?
I think this one is finally ready
can you undraft it then ?
Unfortunately I can’t as it’s not my pull request, @Joffreybvn has to do it. But I’m wondering cause I did a change in the Connection class, this will have an impact on the Airflow dependency then for the http provider?
I think this one is finally ready
can you undraft it then ?
Unfortunately I can’t as it’s not my pull request, @Joffreybvn has to do it. But I’m wondering cause I did a change in the Connection class, this will have an impact on the Airflow dependency then for the http provider?
It's quite a big change. Generally speaking reviewing 700+ PRs will take weeks (if people have time). I wonder if there there is a way to split out parts of it into smaller PRs ?
Usually what we do in such case - we take the bits of such PR, make new PR, review, merge, rebase (rinse and repeat).
It’s a bit hard to split as all modifications are linked to each other. I was already happy I was able to pull it off imho ☺️
It’s a bit hard to split as all modifications are linked to each other. I was already happy I was able to pull it off imho ☺️
I know, but maybe worth trying. There is a lot to review here and someone will have to find quite a bit of chunk of free time to review it, I am going to Pycon next week and for sure will have no time to review this one and since it is a crucial HTTP hook it needs more than one person and quality review, so I dismissed my past review and hopefully there will be someone who can review it promptly enough
It’s a bit hard to split as all modifications are linked to each other. I was already happy I was able to pull it off imho ☺️
I know, but maybe worth trying. There is a lot to review here and someone will have to find quite a bit of chunk of free time to review it, I am going to Pycon next week and for sure will have no time to review this one and since it is a crucial HTTP hook it needs more than one person and quality review, so I dismissed my past review and hopefully there will be someone who can review it promptly enough
I will already create a new PR for the change I did concerning the extra_dejson in the Connection. Ofc that addition in the Connection class won't be used in Airflow yet (apart from the test), but then it would already be a separate PR which could then make this one a bit simplier.
@potiuk What about this one? I've extracted 2 PR from it, the rest of the modifications are more difficult to extract into other PR as they are all linked...
Af few providers are failing due to the get_extra_json method fom above PR, which means those providers would be dependant on a newer Airflow version.
Af few providers are failing due to the get_extra_json method fom above https://github.com/apache/airflow/pull/39811, which means those providers would be dependant on a newer Airflow version.
We should definitely avoid that and implement compatibility code for them. And possibly that's going to be rather easy to solve when we introduce common.compat
provider. https://lists.apache.org/thread/ydx1zn6jwojxxcgfc1x5bg59p87l8lc1 This means that you will have to introduce (and use) a compatibility code in the compat provider - until min airflow version is set to 2.10.
Af few providers are failing due to the get_extra_json method fom above #39811, which means those providers would be dependant on a newer Airflow version.
We should definitely avoid that and implement compatibility code for them. And possibly that's going to be rather easy to solve when we introduce
common.compat
provider. https://lists.apache.org/thread/ydx1zn6jwojxxcgfc1x5bg59p87l8lc1 This means that you will have to introduce (and use) a compatibility code in the compat provider - until min airflow version is set to 2.10.
Hello @potiuk , now that Airflow is at version 2.10, I presume it will be easier to finish this PR no?
Af few providers are failing due to the get_extra_json method fom above #39811, which means those providers would be dependant on a newer Airflow version.
We should definitely avoid that and implement compatibility code for them. And possibly that's going to be rather easy to solve when we introduce
common.compat
provider. https://lists.apache.org/thread/ydx1zn6jwojxxcgfc1x5bg59p87l8lc1 This means that you will have to introduce (and use) a compatibility code in the compat provider - until min airflow version is set to 2.10.Hello @potiuk , now that Airflow is at version 2.10, I presume it will be easier to finish this PR no?
Ok I was to fast, you meant when providers would be dependant of at least 2.10. I've searched through the code but there isn't a compat provider yet is it?
Af few providers are failing due to the get_extra_json method fom above #39811, which means those providers would be dependant on a newer Airflow version.
We should definitely avoid that and implement compatibility code for them. And possibly that's going to be rather easy to solve when we introduce
common.compat
provider. https://lists.apache.org/thread/ydx1zn6jwojxxcgfc1x5bg59p87l8lc1 This means that you will have to introduce (and use) a compatibility code in the compat provider - until min airflow version is set to 2.10.Hello @potiuk , now that Airflow is at version 2.10, I presume it will be easier to finish this PR no?
Ok I was to fast, you meant when providers would be dependant of at least 2.10. I've searched through the code but there isn't a compat provider yet is it?
Yes. There is. providers/common/compat
@dabla - for this one we have to make sure that the tests are "compatible" with older versions of Airflow - because that is how we are testing compatibility of providers with older versions of Airflow - we run the tests after installing old airflow versions. So some assumptions that work in main
version of airflow might not work. Which means that sometimes you might need to add tests/test_utils/compat.py
code to make such test be compatible with Airflow 2.7+ and import / run some common test code from there.
Details here: https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#compatibility-provider-unit-tests-against-older-airflow-releases
@dabla - for this one we have to make sure that the tests are "compatible" with older versions of Airflow - because that is how we are testing compatibility of providers with older versions of Airflow - we run the tests after installing old airflow versions. So some assumptions that work in
main
version of airflow might not work. Which means that sometimes you might need to addtests/test_utils/compat.py
code to make such test be compatible with Airflow 2.7+ and import / run some common test code from there.Details here: https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#compatibility-provider-unit-tests-against-older-airflow-releases
Hello @potiuk , I know this is a tuff one on different levels. Will check the compat part and come back to it. For the backward compat with 2.7 I did the following in the Hook itself, maybe that's the thing that needs to be changed then:
class ConnectionWithExtra(Connection):
"""
Patched Connection class added for backward compatibility.
Implements the `get_extra_dejson` method which was added in the Connection class in Airflow 2.10.0.
This patched class must be removed once the http provider depends on Airflow 2.10.0 or higher.
"""
def __init__(
self,
conn_id: str | None = None,
conn_type: str | None = None,
description: str | None = None,
host: str | None = None,
login: str | None = None,
password: str | None = None,
schema: str | None = None,
port: int | None = None,
extra: str | dict | None = None,
uri: str | None = None,
):
super().__init__(conn_id, conn_type, description, host, login, password, schema, port, extra, uri)
def get_extra_dejson(self, nested: bool = False) -> dict:
"""
Deserialize extra property to JSON.
:param nested: Determines whether nested structures are also deserialized into JSON (default False).
"""
extra = {}
if self.extra:
try:
if nested:
for key, value in json.loads(self.extra).items():
extra[key] = value
if isinstance(value, str):
with suppress(JSONDecodeError):
extra[key] = json.loads(value)
else:
extra = json.loads(self.extra)
except JSONDecodeError:
self.log.exception("Failed parsing the json for conn_id %s", self.conn_id)
# Mask sensitive keys from this list
mask_secret(extra)
return extra
# Then in HttpHook
@cached_property
def connection(self) -> Connection:
"""
Return a cached connection property.
This method calls the original get_connection method from the BaseHook and returns a patched version
of the Connection class which also has the `get_extra_dejson` method that has been added in Apache
Airflow since 2.10.0. Once the provider depends on Airflow version 2.10.0 or higher, this method and
the ConnectionWithExtra class can be removed.
"""
conn = BaseHook.get_connection(conn_id=self.http_conn_id)
return ConnectionWithExtra(
conn_id=self.http_conn_id,
conn_type=conn.conn_type,
description=conn.description,
host=conn.host,
login=conn.login,
password=conn.password,
schema=conn.schema,
port=conn.port,
extra=conn.extra,
)
I've also added a test on it to check when it has to be removed once http provider depends on Airflow 2.9.3 or higher.