airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Configure HttpHook's auth_type from Connection

Open Joffreybvn opened this issue 1 year ago • 30 comments

Hello,

This PR makes possible to setup and parameterize HttpHook and HttpAsyncHook's auth_type from the Connection UI.

Concretely, this PR:

  • Add two extra settings in Http 'Extra' Connection:
    • The reserved auth_type field to define a auth class
    • The reserved auth_kwargs field to provide a dict of extra parameters to the auth_type class.
  • The auth_type is validated against a list of Auth classes, to protect against code injection.
  • The list of Auth classes can be customized:
    • Via the airflow_local_settings.
    • Via the new AIRFLOW__HTTP__EXTRA_AUTH_TYPES config
  • The changes are applied to both HttpHook and HttpAsyncHook, via a new common mixin class.
  • The UI of the Http Connection is refactored:
    • To make those Auth classes more discoverable, with a SelectField for auth_type
    • To have a convenient dedicated field for auth_kwargs

Side effect of the UI changes: The Extra field was until now used to pass params to the Headers (anything in the Extra was passed to the Headers). But now, auth_kwargs and auth_type are also being written over there, which I don't find very convenient. Furthermore, this PR add logic to exclude those keys from the Headers (IMO this start to be a bit of tech-debt). And finally, user cannot pass a header named like those keys (it is unlikely, but it could happen).

Thus, I propose to deprecate headers parameters passed directly in the 'Extra' field. And to pass them via a dedicated "Headers" field.

UI: Screenshot from 2024-01-05 09-05-03

Screenshot from 2024-01-05 09-04-59

Side effect: Screenshot from 2024-01-05 09-11-34

I also tried to add a CodeMirrorField (for Headers and Auth kwargs), and a CollapsibleField (to hide Extra), but it was a bit too much compared to the initial goal of this PR. Maybe in a future one.

Use-case:

The auth_type is typically a subclass of request.AuthBase. Many custom Auth classes exist for many different protocols. Sometimes, passing only two hard-coded conn.username and conn.password is not enough: The Auth class expects more than two arguments.

Examples:

Right now, to deal with those cases, they are three possibilities:

  • Using functools.partial, like mentioned in this PR, in the dag file / in the operator declaration.
    Opinion: The dag developer should not care about handling the connection. He just want a working connection_id to call an endpoint (especially if its a beginner / low-experienced dev). Furthermore, some parameters are sensitive and cannot be written in a dag.
  • Writing a custom Hook, which dispatch the parameters from the Connection correctly (eventually using partial).
    Opinion: This is not okay. Other hooks are doing better. Take the ODBCHook, which allows to parameterize every aspect of the connection without subclassing anything:
    • Defining which driver has to be used
    • Defining connection schema for SQLAlchemy
    • Parameterize the driver via extra driver-specific parameters
    • Parameterize the behavior of pyodbc ("connect_kwargs").
    Everything can be controlled in the Connection UI ! I'd expect the HttpHook to behave similarly, and let me configure everything, which includes the underlying authentication.
  • Misusing the "username" and "password" fields of a Connection, to stack and pass multiple parameters in it + implementing a thin layer on top of a Auth class to re-dispatch the parameters.
    Opinion: This is definitively a bad workaround. I'm mentioning it because this PR won't entirely solve the issue, and this may (continue to) happen.

Coming to this PR, I propose to add two reserved field: "auth_type" and "auth_kwargs", which are passed to the underlying Auth class. No breaking change. This solve most of the issues: a partial is not needed anymore, a subclass is not needed anymore, and there are less cases where conn.username and conn.password will be misused.


^ Add meaningful description above Read the Pull Request Guidelines for more information. In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed. In case of a new dependency, check compliance with the ASF 3rd Party License Policy. In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Joffreybvn avatar Nov 12 '23 21:11 Joffreybvn

This looks good in principle (small and simple, yet powerful), but it would require a more complete documentation and examples in order to be mergable.

potiuk avatar Nov 15 '23 19:11 potiuk

@potiuk What about a more aggressive PR, with most likely a breaking change ? (Sorry for the big chunk of text, here is the important part:)

This PR could copy further what the ODBCHook does, to:

  • Implement a way to set a custom Auth class from the Connection UI / conn.extra_json.
  • Remove the two hard-coded conn.username and conn.password from class instantiation to replaces them by keywords arguments, eventually customizable in conn.extra_json too.
  • Implement a way to customize proxy, ssl verify, and all other extra requests parameters from the Connection UI / conn.extra_json.

Would that be okay ?

Joffreybvn avatar Nov 15 '23 19:11 Joffreybvn

  • Implement a way to set a custom Auth class from the Connection UI / conn.extra_json.

I'd say I am not so thrilled by the other option. and Certainly would not comment on it unless you show the code rather than explain in words what you really mean by changing it. I am not sure if you can pass in words what you want to do do without actually trying and implementing POC where you would show the code and we could assess how "breaking" it is. Airflow is used by 10s of thousands of enterprises and we cannot afford breaking changes that will make everyones workflows broken when migrating. We can break few peple workflows (this is inevitable) but not everyoene's

So before even attempting that, you should answer yourself a few questions. And decide if you want to go there at all.

  • Will that make everyone's connection stop working and require them to manually modify their connnections ? -> certainly NOT - probably not even in Airflow 3 (which we do not even plan yet).

  • Will there be change in any of the Public APIs that you can use to access connections ? -> absolutely notv in Airflow 2. This is forbidden by SemVer and we cannot change the API (incliuding Connection Python object that is retrieved by Hooks and operators) deliberately in Airflow 2.

  • Will there be any migration (automated or not) for the users? Will it handle all the cases?

  • what exactly do you want to achieve and Why you think is benefitial - i.e. is the chnge cost and potential problems justified by the benefit ?

  • How will it impact the UI and ways of defining connections via other mechanisms (env vars, secrets? )

....

I think most of thos questions are only worth looking in detail when there is at least Proof-Of-Concept where discussion can be done over the code rather than abstract concept of the change :) . Otherwise It will take too much time of those who review it to understand what you really want to do - having a code to look at is pretty much starting point of someone looking at proposing a change touching this part - part that is pretty much "core" of Airflow and part of Public Interface of Airflow: https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html

potiuk avatar Nov 15 '23 20:11 potiuk

Thanks for the detailed answer !

I won't go for proxy settings and extra parameters in this PR. Just for info, adding a tool like forwarder solve globally all proxy configuration issues. And disabling ssl "verify" can be done via the CURL_CA_BUNDLE trick instead of adding parameters to control that in the Connection.


For the rest, currently this PR:

  • Allows to configure the Auth class from the Connection
  • Allows to pass extra parameters to instantiate the Auth class from the Connection

There is no breaking change in Airflow. It may be a breaking change for user relying on the previous logic of the property (see below's code-review). But as the property was introduced recently, that won't break many users' worflows.

Joffreybvn avatar Nov 24 '23 00:11 Joffreybvn

i just realised there is likely one big problem here - security.

While we cannot prevent it completely for some kind of connections (this is why Connection Editing user should be highly priviledged, introducing RCE deliberately is another thing.

If I understand correctly, someone who edits connection can decide which arbtirary class will be instantiated and executed when HTTP connection is established via HTTP Hook ? Which - if I understand correctly is basically a "no-go" - we removed a number of cases like that from the past from a number of providers precisely for that reason.

Is there any way we can make UI connection "declarative" for that? for example we could limit the list of predefined auth types we can choose. Does it make sense at all?

potiuk avatar Nov 29 '23 01:11 potiuk

Makes total sense ! I added a AIRFLOW__HTTP__EXTRA_AUTH_TYPES parameter for the Airflow 'Deployment manager' to control which classes can be set as "auth_type". By default, only classes from requests.auth and the ones added via this parameter can be used. If any other class is setup, the following message will appear in the logs, when executing the task: MicrosoftTeams-image


Still WIP: I'm looking into customizing the UI to have special fields based on the extra params.

Joffreybvn avatar Dec 06 '23 21:12 Joffreybvn

Alright, aside from a last issue with the non-DB test, this PR is ready for review. I updated the first post with current status.

Joffreybvn avatar Jan 09 '24 20:01 Joffreybvn

Yes. Conceptually this is much better. I would need to take a closer look but first glance it looks good. Can you fix the last test, and I can take a look then,

potiuk avatar Jan 14 '24 23:01 potiuk

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Apr 07 '24 00:04 github-actions[bot]

Here is a screenshot of the updated HTTP (and Livy) connection form. I had to update the javascript file of connection_form.js to not only enable CodeMirror on the extra field but also on the other possible user defined textarea fields. image The extra field will be hidden, here I kept it visible for debugging purposes. Also the description textarea field became a CodeMirror widget but in the final version that one is skipped and stays as a regular textarea.

dabla avatar May 07 '24 16:05 dabla

I think this one is finally ready

dabla avatar May 08 '24 19:05 dabla

I think this one is finally ready

can you undraft it then ?

potiuk avatar May 09 '24 06:05 potiuk

I think this one is finally ready

can you undraft it then ?

Unfortunately I can’t as it’s not my pull request, @Joffreybvn has to do it. But I’m wondering cause I did a change in the Connection class, this will have an impact on the Airflow dependency then for the http provider?

dabla avatar May 09 '24 06:05 dabla

I think this one is finally ready

can you undraft it then ?

Unfortunately I can’t as it’s not my pull request, @Joffreybvn has to do it. But I’m wondering cause I did a change in the Connection class, this will have an impact on the Airflow dependency then for the http provider?

dabla avatar May 09 '24 06:05 dabla

It's quite a big change. Generally speaking reviewing 700+ PRs will take weeks (if people have time). I wonder if there there is a way to split out parts of it into smaller PRs ?

potiuk avatar May 09 '24 12:05 potiuk

Usually what we do in such case - we take the bits of such PR, make new PR, review, merge, rebase (rinse and repeat).

potiuk avatar May 09 '24 12:05 potiuk

It’s a bit hard to split as all modifications are linked to each other. I was already happy I was able to pull it off imho ☺️

dabla avatar May 09 '24 16:05 dabla

It’s a bit hard to split as all modifications are linked to each other. I was already happy I was able to pull it off imho ☺️

I know, but maybe worth trying. There is a lot to review here and someone will have to find quite a bit of chunk of free time to review it, I am going to Pycon next week and for sure will have no time to review this one and since it is a crucial HTTP hook it needs more than one person and quality review, so I dismissed my past review and hopefully there will be someone who can review it promptly enough

potiuk avatar May 09 '24 18:05 potiuk

It’s a bit hard to split as all modifications are linked to each other. I was already happy I was able to pull it off imho ☺️

I know, but maybe worth trying. There is a lot to review here and someone will have to find quite a bit of chunk of free time to review it, I am going to Pycon next week and for sure will have no time to review this one and since it is a crucial HTTP hook it needs more than one person and quality review, so I dismissed my past review and hopefully there will be someone who can review it promptly enough

I will already create a new PR for the change I did concerning the extra_dejson in the Connection. Ofc that addition in the Connection class won't be used in Airflow yet (apart from the test), but then it would already be a separate PR which could then make this one a bit simplier.

dabla avatar May 15 '24 12:05 dabla

@potiuk What about this one? I've extracted 2 PR from it, the rest of the modifications are more difficult to extract into other PR as they are all linked...

Af few providers are failing due to the get_extra_json method fom above PR, which means those providers would be dependant on a newer Airflow version.

dabla avatar Jun 19 '24 13:06 dabla

Af few providers are failing due to the get_extra_json method fom above https://github.com/apache/airflow/pull/39811, which means those providers would be dependant on a newer Airflow version.

We should definitely avoid that and implement compatibility code for them. And possibly that's going to be rather easy to solve when we introduce common.compat provider. https://lists.apache.org/thread/ydx1zn6jwojxxcgfc1x5bg59p87l8lc1 This means that you will have to introduce (and use) a compatibility code in the compat provider - until min airflow version is set to 2.10.

potiuk avatar Jun 20 '24 12:06 potiuk

Af few providers are failing due to the get_extra_json method fom above #39811, which means those providers would be dependant on a newer Airflow version.

We should definitely avoid that and implement compatibility code for them. And possibly that's going to be rather easy to solve when we introduce common.compat provider. https://lists.apache.org/thread/ydx1zn6jwojxxcgfc1x5bg59p87l8lc1 This means that you will have to introduce (and use) a compatibility code in the compat provider - until min airflow version is set to 2.10.

Hello @potiuk , now that Airflow is at version 2.10, I presume it will be easier to finish this PR no?

dabla avatar Jul 18 '24 13:07 dabla

Af few providers are failing due to the get_extra_json method fom above #39811, which means those providers would be dependant on a newer Airflow version.

We should definitely avoid that and implement compatibility code for them. And possibly that's going to be rather easy to solve when we introduce common.compat provider. https://lists.apache.org/thread/ydx1zn6jwojxxcgfc1x5bg59p87l8lc1 This means that you will have to introduce (and use) a compatibility code in the compat provider - until min airflow version is set to 2.10.

Hello @potiuk , now that Airflow is at version 2.10, I presume it will be easier to finish this PR no?

Ok I was to fast, you meant when providers would be dependant of at least 2.10. I've searched through the code but there isn't a compat provider yet is it?

dabla avatar Jul 18 '24 13:07 dabla

Af few providers are failing due to the get_extra_json method fom above #39811, which means those providers would be dependant on a newer Airflow version.

We should definitely avoid that and implement compatibility code for them. And possibly that's going to be rather easy to solve when we introduce common.compat provider. https://lists.apache.org/thread/ydx1zn6jwojxxcgfc1x5bg59p87l8lc1 This means that you will have to introduce (and use) a compatibility code in the compat provider - until min airflow version is set to 2.10.

Hello @potiuk , now that Airflow is at version 2.10, I presume it will be easier to finish this PR no?

Ok I was to fast, you meant when providers would be dependant of at least 2.10. I've searched through the code but there isn't a compat provider yet is it?

Yes. There is. providers/common/compat

potiuk avatar Jul 18 '24 14:07 potiuk

@dabla - for this one we have to make sure that the tests are "compatible" with older versions of Airflow - because that is how we are testing compatibility of providers with older versions of Airflow - we run the tests after installing old airflow versions. So some assumptions that work in main version of airflow might not work. Which means that sometimes you might need to add tests/test_utils/compat.py code to make such test be compatible with Airflow 2.7+ and import / run some common test code from there.

Details here: https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#compatibility-provider-unit-tests-against-older-airflow-releases

potiuk avatar Aug 02 '24 16:08 potiuk

@dabla - for this one we have to make sure that the tests are "compatible" with older versions of Airflow - because that is how we are testing compatibility of providers with older versions of Airflow - we run the tests after installing old airflow versions. So some assumptions that work in main version of airflow might not work. Which means that sometimes you might need to add tests/test_utils/compat.py code to make such test be compatible with Airflow 2.7+ and import / run some common test code from there.

Details here: https://github.com/apache/airflow/blob/main/contributing-docs/testing/unit_tests.rst#compatibility-provider-unit-tests-against-older-airflow-releases

Hello @potiuk , I know this is a tuff one on different levels. Will check the compat part and come back to it. For the backward compat with 2.7 I did the following in the Hook itself, maybe that's the thing that needs to be changed then:

class ConnectionWithExtra(Connection):
    """
    Patched Connection class added for backward compatibility.
    Implements the `get_extra_dejson` method which was added in the Connection class in Airflow 2.10.0.
    This patched class must be removed once the http provider depends on Airflow 2.10.0 or higher.
    """

    def __init__(
        self,
        conn_id: str | None = None,
        conn_type: str | None = None,
        description: str | None = None,
        host: str | None = None,
        login: str | None = None,
        password: str | None = None,
        schema: str | None = None,
        port: int | None = None,
        extra: str | dict | None = None,
        uri: str | None = None,
    ):
        super().__init__(conn_id, conn_type, description, host, login, password, schema, port, extra, uri)

    def get_extra_dejson(self, nested: bool = False) -> dict:
        """
        Deserialize extra property to JSON.
        :param nested: Determines whether nested structures are also deserialized into JSON (default False).
        """
        extra = {}

        if self.extra:
            try:
                if nested:
                    for key, value in json.loads(self.extra).items():
                        extra[key] = value
                        if isinstance(value, str):
                            with suppress(JSONDecodeError):
                                extra[key] = json.loads(value)
                else:
                    extra = json.loads(self.extra)
            except JSONDecodeError:
                self.log.exception("Failed parsing the json for conn_id %s", self.conn_id)

            # Mask sensitive keys from this list
            mask_secret(extra)

        return extra

    # Then in HttpHook

    @cached_property
    def connection(self) -> Connection:
        """
        Return a cached connection property.

        This method calls the original get_connection method from the BaseHook and returns a patched version
        of the Connection class which also has the `get_extra_dejson` method that has been added in Apache
        Airflow since 2.10.0. Once the provider depends on Airflow version 2.10.0 or higher, this method and
        the ConnectionWithExtra class can be removed.
        """
        conn = BaseHook.get_connection(conn_id=self.http_conn_id)

        return ConnectionWithExtra(
            conn_id=self.http_conn_id,
            conn_type=conn.conn_type,
            description=conn.description,
            host=conn.host,
            login=conn.login,
            password=conn.password,
            schema=conn.schema,
            port=conn.port,
            extra=conn.extra,
        )

I've also added a test on it to check when it has to be removed once http provider depends on Airflow 2.9.3 or higher.

dabla avatar Aug 03 '24 08:08 dabla