airflow icon indicating copy to clipboard operation
airflow copied to clipboard

DockerHook: obtain credentials and login to Amazon ECR

Open Taragolis opened this issue 3 years ago • 18 comments

Amazon ECR Private registry do not have permanent credentials. Users required manually obtain credentials and renew them every 12 hours or use amazon-ecr-credential-helper.

This PR allow DockerHook get credentials not only from Airflow Connection but also use internal BaseDockerCredentialHelper class for obtain credentials to private registries which not support permanent/service credentials

Taragolis avatar Sep 05 '22 14:09 Taragolis

I think this is good idea, but the ECRDockerCredentialHelper should be part of Amazon provider, and likely there should be an ECRDockerOperator that uses it by default. This is very similar to what we have for example in GKEStartPodOperator which is based on generic KubernetesPodOperator and pretty much does only authentication specific for GCP.

I think this is a very similar case.

potiuk avatar Sep 05 '22 19:09 potiuk

@potiuk FYI: The same problem exists in Postgres hook: https://github.com/apache/airflow/blob/5b216e9480e965c7c1919cb241668beca53ab521/airflow/providers/postgres/hooks/postgres.py#L171

mik-laj avatar Sep 06 '22 00:09 mik-laj

I think this is good idea, but the ECRDockerCredentialHelper should be part of Amazon provider

Yeah it could be. I put it into docker provider for reduce complexity of dependency.

and likely there should be an ECRDockerOperator that uses it by default

Why I move this part inside of DockerHook hook:

  1. docker_conn_id not only use in DockerOperator also it use in DockerSwarmOperator and @task.docker task-flow decorator which won't work with EcrDockerOperator
  2. This is optional parameter

Taragolis avatar Sep 06 '22 02:09 Taragolis

@potiuk FYI: The same problem exists in Postgres hook:

And also AWS RDS IAM auth exists in MySQLHook. However I don't think it even works in MySQL case

https://github.com/apache/airflow/blob/5b216e9480e965c7c1919cb241668beca53ab521/airflow/providers/mysql/hooks/mysql.py#L215-L220

Taragolis avatar Sep 06 '22 03:09 Taragolis

@potiuk FYI: The same problem exists in Postgres hook:

Yeah it could be. I put it into docker provider for reduce complexity of dependency. And also AWS RDS IAM auth exists in MySQLHook. However I don't think it even works in MySQL case

I tihnk this should be changed then also in Postgres and MySQL :) . The approach wwhere "all AWS" goes to AWS actually makes dependencies less not more complex for the user. I thik we should avoid any kind of circular dependencies and we won't avoid them if "mysql" provider refers to "aws" one.

Both Postgres and MySQL impementation with local imports are making the dependencies equally (or even more) complex, but they are worse because they are "hidden". If you want to reason for example "which version of AWS provider we should have in order to make Postgres work with AWS" - it is impossible to answer, because by using local imports we are hiding the fact that there is depedency (but is there but implicit and hidden).

I think making an explicit dependency chain AWS -> POSTGRES, MSQL -> COMMON.SQL is the only reasonable approach. and moving "AWS specific Postgres -related classes" should be in AWS.

  • COMMON.SQL -> should have only generic SQL
  • POSTGRES -> should add postgres functionality on top
  • AWS -> shoudl add AWS-specific postgres extensions

IMHO This is our ultimate goal even if there are are cases that are not following this.

potiuk avatar Sep 06 '22 10:09 potiuk

Why I move this part inside of DockerHook hook:

docker_conn_id not only use in DockerOperator also it use in DockerSwarmOperator and @task.docker task-flow decorator which won't work with EcrDockerOperator This is optional parameter

It should be done in discoverable way. The very same how in "common.sql" we instantiate Hook based on "connection_id" (without knowing the type of the hook upfront). The only requirement is that it extends DBApiHook. We should have a common authentication interface implemented by Docker Operator, and some functionality in the Hook to allow the "authentication" to be discovered and used.

potiuk avatar Sep 06 '22 11:09 potiuk

I tihnk this should be changed then also in Postgres and MySQL :) .

Let's open Pandora box. What we should do with this ones?

No dependencies in provider (exists for couple years) https://github.com/apache/airflow/blob/6931fbf8f7c0e3dfe96ce51ef03f2b1502baef07/airflow/providers/amazon/aws/hooks/base_aws.py#L240-L245

Local imports from google provider https://github.com/apache/airflow/blob/6931fbf8f7c0e3dfe96ce51ef03f2b1502baef07/airflow/providers/amazon/aws/hooks/base_aws.py#L303-L310

Thats is only that I know, might be also some of the same stuff exists somewhere.

We should have a common authentication interface implemented by Docker Operator

I think we should have some common authentication interface for all Hooks which implements connection ability.

Taragolis avatar Sep 06 '22 12:09 Taragolis

Wrong box to open :). This is not part of it.

Both of the examples you mentioned @Taragolis are clearly AWS ones they need no fix as they are in the right place.

They are (and should be) in AWS. In Both cases the "USER" of the authentication (AWS) has dependency on the functionality they use. Having such a functionality for optional AWS <> Google dependency is perfectly fine and it is even "blessed" in our package system. the amazon provider has [google] extra and google provider has [amazon] extra. We even have an exception specially foreseen for that - not yet heavily used but once we split providers into separate repos I was planning to consistently apply it in a places that do not have it.

class AirflowOptionalProviderFeatureException(AirflowException):
    """Raise by providers when imports are missing for optional provider features."""

Those two stories are different thatn here is very easy (even if not "technical" - this is more looking at the landscape of our providers from the business side of things than pure interface/API and it is more based on likelihood of being better maintained than anything else). The decision where to put so code that is in-between is bound more to whether there are clear stakeholders behind the provider that mostly maintain the provider and are interested in having this functionality in.

It is captured in a few places already. We have Release process for providers but also we already used the same line of thoughts when we decided where to put transfer operators. See AIP-21 Changes in import paths - and also touched upon in AIP-8 - Split Provider packages for Airflow 2.0. One of the decisions we made there is that the the "transfer" operators are put in the "target" of the transfer when there is a clear stakeholder behind the target. And in this case circular references are unavoidable.

To sum up the linne of thoughts in one sentence: the idea is that the code should be put in the "provider" where there is a stakeholder that is most likely to maintain the code.

And 'Postgres' and 'MySQL' are different. They have no clear stakeholders behind that are interested in maintaining those. Even though they are commercial databases with companies behind, they are 'commodity' APIS and they are not interested in maintaining services. And neither Postgres nor MySQL are insterested in adding interfacing with AWS/GCP to the provider. But both AWS and GCP are respectively interested to allow AWS/GCP authentication WITH the Postgres/MySQL provider they expose in their own Services offering.

This has been discussed for a long time when AIP-21 was discussed - there are different kinds of providers. Simply speaking SAML/GSSAPI are "protocols", there are also "databases" (like Postgres/MySQL) and then there are "Services" which are higher layer of abstraction and while "Services" can use "Protocols" and "Databases", neither "Protocols" nor "Databases" should use "Services" - they can only be "used" by services.

This will be much more visible and obvious when we split providers to separate repos. The ideal situation should be, that people who are from AWS should be subscribed to that single "amazon" provider repo to get notifications about all the AWS-specific changes they are interested in. And this authentication code clearly falls into this "bucket".

The Google Federated identity code is also a very good example here - while it is clearly GCP-related, this is AWS people who are interested to get it working and make sure they can connect to GCP (for example to be able ot Pull data from it etc.). They will be maintaining the code, not the Google people.

potiuk avatar Sep 06 '22 13:09 potiuk

Wrong box to open :). This is not part of it.

But it do almost the same the same thing that PostgresHook.get_iam_token and MySqlHook.get_iam_token - grant access by use AWS IAM auth method. This methods exists for very long time (might be before even providers exists)

In case of airflow.providers.amazon.aws.hooks.base_aws.BaseSessionFactory._get_google_identity_token_loader it use GCP to auth in AWS. So if apply the same logic this method should live in Google Provider rather than Amazon Provider.

Taragolis avatar Sep 06 '22 14:09 Taragolis

In the ideal world auth to different hooks should be pluggable. In this case user might be auth to Postgres by different ways (Explicit Credentials, pgpass, GSAPI, AWS IAM, etc) the same way also might work with Docker Registries (Explicit Credentials, ECR Get Token) as well as additional methods to auth into AWS Cloud.

Just for example Data Grip or Jetbrains Database Tools and SQL plugin. As soon as I installed third party plugin AWS Toolkit it enable additional auth methods to Postgres and MySQL

image

Comparison might be not correct Data Grip as well as Pro version of JB IDEs is commercial product and Airflow is OSS.

Taragolis avatar Sep 06 '22 15:09 Taragolis

Wrong box to open :). This is not part of it.

But it do almost the same the same thing that PostgresHook.get_iam_token and MySqlHook.get_iam_token - grant access by use AWS IAM auth method. This methods exists for very long time (might be before even providers exists)

In case of airflow.providers.amazon.aws.hooks.base_aws.BaseSessionFactory._get_google_identity_token_loader it use GCP to auth in AWS. So if apply the same logic this method should live in Google Provider rather than Amazon Provider.

Yes. the difference is WHAT is doing it. If this is the code that is under stewardship of AWS, they choose to do it and maintain it. In Postgres there is no-one to maintain it. This is the main criteria. You are using wrong comparision. It's not what code it is, but who is interested in maintainig it and reacting when there is a problem.

To put it simple. If there is in the future a problem with library of Google that will cause a problem with this operator - the AWS people will have to fix it. and if - for example - it causes us a conflict with other providers (say Google) we might choose not to release AWS provider until AWS people will fix the conflict. Plain and simple. AWS is impacted, they are fixing it. With spltting providers, I want AWS people to subscribe to aws repo and be aware of all the things they will need to fix.

When Google provider uses AWS for authentication - it still holds. It's the Google team that should fix any problems that might undermine their way of interfacing with AWS.

With Postgres, the situation is different. If the same problem happens with Postgres provider, It will hold Postgres provider from being released. So the problem with AWS will hold a new release of Postgres provider. This is not something we want.

Again - your parallells are not entirely correct - it's not "what code is needed". I agree it is similar code. But IMHO it does not matter. What matters is "who has the incentive to fix it if it is broken", This is fundamental difference. Think "human", "team", "responsibility", "governance" not "code".

In the ideal world auth to different hooks should be pluggable. In this case user might be auth to Postgres by different ways (Explicit Credentials, pgpass, GSAPI, AWS IAM, etc) the same way also might work with Docker Registries (Explicit Credentials, ECR Get Token) as well as additional methods to auth into AWS Cloud.

I don't think it's an "ideal world". It's just one of the choices. Right now in Airflow authentication is very closely connected to the Connection and corresponding Hook. Basically Auth is THE ONLY functionality that Hook provides on top of most of the integrations. You give it a connection id and it gives you back some form of a client that you can use. Other than that hooks are not very useful - they are passing through parameters to underlying "clients". Qute often, particular Hook returns you directly a Python client that you could instantiate and use yourself. The actual "benefit" of using Hooks is that you can pass it the connection id. That's pretty much it. And this is big and important feature.

Currently, the way to implement different authentications it is through inheritance (if we want to do handling at the Hook level). If you need an extended Hook with new authentication, you extend one that is there and add the authentication. This fits very well Airflow's model - where Hook is a class and it is extendable and can be passed a connection id. Postgres Hook should get postgres connection id and so on. If you want to implement Postgres Hook with the conneciton Id from Google, you should implement a GooglePostgresHook "child" that can take Google Connection as initialization param. Or if you want to use AWS authentication - implement an AWSPostgres Hook "child" that can take "AWS connection" as initialization param. And this is the model used currently. This is a different model than "pluggable" authentication - where you "extend" rather tha "inject" authentication.

We might have different view on which model is better. I personally think the "pluggable authentication" is far more difficult to make into a generic solution that is applicable to multiple types of hook, I'd say airflow's model where you extend functionality and add authentication as you need it in "child" class is much better - as it saves you to define a common "authenticaiton" layer - that would have to handle a lot of complexities - a good example is impersonation chain from GCP which is very specific for GCP, and implementing common "authentication" model which would handle, GCP, Azure, AWS, Databricks, and big number of other options - via HTTP, proprietary database connection types, OpenID, Oauth etc is far more complex and error-prone.

But no-matter what our thinking on more/less complex is - the "Extension" model is what is being used in Airflow.

potiuk avatar Sep 11 '22 23:09 potiuk

Again - your parallells are not entirely correct - it's not "what code is needed". I agree it is similar code. But IMHO it does not matter. What matters is "who has the incentive to fix it if it is broken", This is fundamental difference. Think "human", "team", "responsibility", "governance" not "code".

It depends which way you look. For me it look like the same some additional method added years ago and not well maintained and documented:

  • IAM auth in Postgres Hook #5223
  • Google federation for AWS #12079

The reason why IAM exists in postgres provider and not in amazon as well as google federation in amazon provide and not in google it is clear for me - no other way exists for extend authentication.

To be clear I'm agree that auth by IAM from postgres/mysql should move out from postgres provider. I just think about appropriate replacement which might easy to maintain and easy add additional integration with other cloud providers.

Currently, the way to implement different authentications it is through inheritance

The issue that right now if you extend hook than you need to also implements all operators. So if we create AwsPostgresHook than also need to implement in additional all existed operators and task flow decorators like AwsPostgresToGCSOperator.

It could be done as well as done for DBApiHook, however DBApiHook not only about auth and also it about support specific protocols for specific database.

However for Postgres, Docker, MySQL and might be other it only about how to implements the way how to obtain credentials and pass for specific client. It might be expect execute some class. In this case it might required some changes in core, so it not available in providers until min version of airflow will meet requirements.

And right might be it more about create some AIP and discuss with different approaches and benefits, WDYT?

Taragolis avatar Sep 15 '22 12:09 Taragolis

@mik-laj @potiuk Let's finalise discussion. We not ready to add auth to Amazon ECR by this way it described in the PR for different reasons:

  1. We need to move auth to AWS from postgres provider to amazon
  2. We can not use import_string for untrusted modules but right now but if we move EcrDockerCredentialHelper to aws I do not see any possibility to do that without specify it in somewhere in postgres hook.

So might be some additional way to help user to integrate with ECR if it required? It could be some page in to documentation like this snippet https://github.com/apache/airflow/issues/9707#issuecomment-773353852 or create Operator which could update specific docker connection by ECR credentials

WDYT guys?

Taragolis avatar Sep 15 '22 12:09 Taragolis

Why not using the oppoertunity to add the auth mechanism now? I do not think this needs a super-generic solution that will work for all hooks. I think there are a few operators that really need this kind of pluggable authentication, I think this DockerHook change is a good opportunity to add it.

I would say we need something very similar that requests library uses, where you can pass auth object https://requests.readthedocs.io/en/latest/user/authentication/ - and you could add the "auth" class to pass to DockerHook and Docker operators to pass "DockerAuth" Protocol / Abstract class that would handle the authentication.

Once we get it implemented here, we could apply similar approach to Postgres and fix the current behaviour and any other class that has it done in a similar class. I think there could be various auth Classes in "Amazon" - one for Docker, one for Postgres, etc. and they could use some common code if needed.

Do you think of any drawbacks of the approach @Taragolis ?

potiuk avatar Sep 19 '22 06:09 potiuk

@potiuk Right now I thought this is only one approach which could cover:

  • Auth live in provider packages or user-defined scripts
  • Do not use import_string

It is only required small changes in current operators and do not need re-implement everything in case if it required only the way of Auth. This PR also in draft by the one reason I do not have an idea the best place where write the documentation how to use Auth.

It might be:

  • In docker provider just information how to implements custom and what actually DockerHook send to this class
  • In amazon provider information about what user should provide in ECR related class

And also I would like to add some information about other Hooks and something which might (or might not) be nixe implemented in the future. Which not related to current PR


HttpHook

Right now my daily workload do not required but couple years ago we were using it very active. I do not remember might be in 1.10.4 HttpHook do not even have this parameter for request Auth (today I lazy to check it).

But major issue currently you need to create new hook and overwrite get_conn method if you need provide not only login and password from connection. So may be it also good idea to implement some generic way to grab credentials from Connection and provide into HttpHook, so it wouldn't require re-create hook and operators in cases if only required custom Auth

I found this code from the old project (note: code created for 1.10.x)

Custom Auth

class BearerAuth(AuthBase):
    """ Bearer Authorization implementation for requests """

    def __init__(self, token):
        self._token = token

    def __call__(self, r):
        r.headers["Authorization"] = "Bearer %s" % self._token
        return r
class SomeDumbAuth(AuthBase):
    """ Authorization for some private API """

    def __init__(self, key, secret):
        self._key = key
        self._secret = secret

    def __call__(self, r):
        if not r.body:
            r.body = ""
        if r.body:
            b = {x.split("=")[0]: x.split("=")[1] for x in r.body.split("&")}
        else:
            b = {}

        b.update({
            "some_key": self._key,
            "some_secret": self._secret,
        })
        r.body = "&".join("%s=%s" % (k, v) for k,v in b.items())

        return r
class AppStoreAuth(AuthBase):
    """ AppStore Authorization implementation for requests """

    def __init__(self, private_key, key_id, issuer_id):
        self._private_key = private_key
        self._key_id = key_id
        self._issuer_id = issuer_id

    def __call__(self, r):
        headers = {
            "alg": 'ES256',
            "kid": self._key_id,
            "typ": "JWT"
        }
        payload = {
            "iss": self._issuer_id,
            "exp": int((datetime.now() + timedelta(minutes=20)).strftime("%s")),
            "aud": "appstoreconnect-v1"
        }

        token = jwt.encode(
            payload=payload,
            key=self._private_key,
            algorithm='ES256',
            headers=headers
        ).decode(encoding="utf-8")

        r.headers["Authorization"] = "Bearer %s" % token
        return r

Hook which use custom auth from conn

class AppStoreSalesHttpHook(HttpHook):
    """ HTTP Hook for AppStore connect API """

    def __init__(self, endpoint, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.endpoint = endpoint

    def get_conn(self, headers: dict = None):
        """ Returns http requests session for AppStore connect use with requests

        Args:
            headers: additional headers to be passed through as a dictionary
        """
        session = requests.Session()
        if self.http_conn_id:
            conn = self.get_connection(self.http_conn_id)

            if conn.password:
                private_key = conn.password.replace("\\n", "\n")
                key_id = conn.extra_dejson.get('KeyId')
                issuer_id = conn.extra_dejson.get('IssuerId')

                session.auth = AppStoreAuth(
                    private_key=private_key,
                    key_id=key_id,
                    issuer_id=issuer_id,
                )
            else:
                raise ValueError("Missing extra parameters for connection %r" % self.http_conn_id)

            if conn.host and "://" in conn.host:
                self.base_url = conn.host
            else:
                # schema defaults to HTTP
                schema = conn.schema if conn.schema else "http"
                host = conn.host if conn.host else ""
                self.base_url = schema + "://" + host

            if conn.port:
                self.base_url = self.base_url + ":" + str(conn.port)

        if headers:
            session.headers.update(headers)

        return session

PostgresHook

It is not a big deal to change in hook to use custom Auth. One thing what I would change it drop Redshift support from PostgresHook

https://github.com/apache/airflow/blob/6045f7ad697e2bdb934add1a8aeae5a817306b22/airflow/providers/postgres/hooks/postgres.py#L191-L205

In Amazon provider there is two different ways (and two different hooks) how to interact with Redshift

  1. DB-API by redshift-connector
  2. By AWS API and boto3

Event thought Redshift use PostgreSQL protocol I'm not sure that psycopg2 officially support Redshift or not but psycopg (v3) not officially supported it - https://github.com/psycopg/psycopg/issues/122 .


MySQL

I'm still not sure that the current implementation actually allow use AWS IAM (need to check). But also for MySQL Auth class need to have ability for specify which driver user actually use

https://github.com/apache/airflow/blob/6045f7ad697e2bdb934add1a8aeae5a817306b22/airflow/providers/mysql/hooks/mysql.py#L169-L179


Extensible Connection / Pluggable Auth

In my head right now this part a bit more than just Auth. Currently Airflow Connection might use for different things.

  • For authentication
  • For configuration client
  • Parameters for API call (like Amazon EMR Connection, which actually not a connection and use only in one place)

The main problem how it showed in the UI and how it stored. Good sample is AWS Connection:

  • Quite a few different method and extra parameters which actually won't work together
  • boto3 configurations

Extensible Connection might show different ui depend on select Auth Type, also it would be nice have different tabs for Auth and Configurations.

But again it is just an idea which in my head might be brilliant but in real world required a lot of changes. Also I'm not even try to draw some scratches on paper how it would work and integrate with airflow and components.

Taragolis avatar Sep 19 '22 11:09 Taragolis

Yep. I want to see how your proposal will look like after rebase, but yeah - the smaller set of changes, the better for now.

potiuk avatar Sep 20 '22 12:09 potiuk

@potiuk I think it would required small changes if compare to the current PR.

  1. Rename from BaseDockerCredentialHelper to something like BaseDockerRegistryAuth credential helper might confuse some user - in docker stack already exists utilities with the same name.
  2. Move ECR related to amazon provider
  3. Assign as Hook parameter rather than import_string
  4. Some small changes in operator for allow provide this classes

But year better have a look after I make changes. I hope it will happen within a couple days.

Taragolis avatar Sep 20 '22 12:09 Taragolis

Fantastic!

potiuk avatar Sep 20 '22 13:09 potiuk

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Nov 05 '22 00:11 github-actions[bot]

Will rebase and make changes. Totally forgot about this PR

Taragolis avatar Nov 05 '22 17:11 Taragolis

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Dec 21 '22 00:12 github-actions[bot]