airflow icon indicating copy to clipboard operation
airflow copied to clipboard

SPIKE: Investigate what should be shared between server & client and determine if a common package is needed

Open kaxil opened this issue 6 months ago • 21 comments

With server-client separation, we need to determine what should be shared vs duplicated vs kept separate.

Should the Providers with executors be separated into providers to that they can be included on Server side?

Categories of Potentially Shared Components

Core Interfaces and Protocols

Exception Hierarchies

  • AirflowException: Base exception class
  • AirflowSkipException: Task skipping exceptions
  • AirflowSensorTimeout: Sensor-specific exceptions
  • Common error types: Used by both server and SDK

Core Utilities

  • Module loading: import_string, import_module_safely
  • Trigger rules: TriggerRule enum and logic
  • Date/time utilities: Timezone handling, date parsing
  • String utilities: Template processing, formatting

Serialization Components

  • Schema definitions: JSON schemas for DAG representation
  • Schema versioning: Version management and compatibility
  • Validation logic: Schema validation and error handling

Configuration Components

  • Configuration interfaces: Protocol definitions for config access
  • Default values: Shared configuration defaults
  • Validation logic: Configuration validation rules
  • Environment handling: Environment variable processing

Shared Package Strategy Options

Option 1: Single Shared Package (airflow-protocols / airflow-commons)

  • Approach: All shared components in one package
  • Pros: Simple dependency management, single source of truth
  • Cons: Package may become large, mixed concerns
  • Use case: Clean separation with minimal package proliferation

Option 2: No Shared Package (Duplication)

  • Approach: Each package implements its own versions
  • Pros: Complete independence, no circular dependencies
  • Cons: Code duplication, maintenance overhead
  • Use case: When separation is more important than efficiency

Key Questions to Answer

  • Should the Providers with executors be separated into providers to that they can be included on Server side?
  • How many packages is optimal? Trade-off between focus and complexity
  • What naming convention is clearest? For users and developers
  • How to handle backward compatibility? Migration path for existing users
  • What about provider dependencies? Which packages should providers depend on?
  • Testing strategy? How to test package combinations efficiently

Other open questions

  • Where should Triggers be? They are inherited in providers for various deferrable operators.

Definition of Done

  • All package structure options evaluated
  • Dependency and testing implications understood
  • Clear recommendation with migration path

kaxil avatar Jun 09 '25 18:06 kaxil

We are bound to need at least one separate package if we want to entirely separate dependencies. For example, the WeightRule enum is fundamentally a user-facing type that should live in the sdk, but the default value in BaseOperator if not given explicitly is configurable, and core needs the class to validate the configuration. There are multiple such things on both dag and task level.

uranusjr avatar Jun 17 '25 05:06 uranusjr

Yes. I think there are far too many things that we share that we cannot avoid common package. Moreover we should invest to make it in the way that it consists of loose collection of independent things that do not depend on each other - for example we should unentangle all the conf + settings circular depedencies that we have currently.

While I understand being averse of all kinds of common/util things. I believe that if they are done right and they are:

  • small
  • "low layer" (i.e. they are only used by core they never use anything else from the "utils" or other parts of the code).
  • separated
  • not dependending on each other

It's cool.

Moreover with uv workspace - we can actually enforce most of that. Simply "common" package should be at the "bottom" - i.e. "core" and "sdk" and whatever else needs it should have it as workspace dependency, but it should have a dependency on other parts of the system

Then - if we follow what I've implemented for all the other packages - if that common distribution has it's own tests - they will be executed after uv sync run in "common" - which means that no other distributions will be installed when running the tests, this is the enforcement we need so that no other circular dependencies, or otherwise depending on "stuff" from core in the "common" package is not happening.

This has been my proposal from the beginning on the discussion and it has not changed.

potiuk avatar Jun 17 '25 08:06 potiuk

And I do not think we need a spike. For me it's just "common sense" (pun intended).

potiuk avatar Jun 17 '25 08:06 potiuk

As an example of successufl implementation of this pattern you can look at breeze https://github.com/apache/airflow/tree/main/dev/breeze/src/airflow_breeze/utils -> this is essentially "common" - set of loose modules wirh various low-level functionalities that are used across multiple breeze commands. Not separate distribution but shows the patttern I think makes sense.

potiuk avatar Jun 17 '25 08:06 potiuk

Also few opinionated answers from my side:

  • Should the Providers with executors be separated into providers to that they can be included on Server side?

Yes

  • How many packages is optimal? Trade-off between focus and complexity

One common package with all the "small" utils that can be used, likely with some optional dependencies for utils that might or might not be needed. Usually the complexity and overhead is with dependencies that given "common util" bring - not with the code of the util itself.

I'd say airflow-*: common-util, scheduler, api-server, triggerer, task-sdk (for all kinds of workers) is probably best approach. And "apache-airflow" as meta-data package installing all of them together.

  • What naming convention is clearest? For users and developers

I'd vote for apache-airflow-* from the above list .

  • How to handle backward compatibility? Migration path for existing users

I think we should stick with releasing all those packages together in the same version (except task-sdk that should be always backwards compatible - until Airlfow 4. This will help with testing - because we will not introduce matrix of versions. At the same time package split will allow us to have far less dependencies in different packages.

  • What about provider dependencies? Which packages should providers depend on?

task-sdk only. Task-sdk should not depend on common-util. There some duplication is likely but we should handle it at pre-commit level where task sdk code (serialization) is copied to common util for example)

  • Testing strategy? How to test package combinations efficiently

There should not be combinations. All packages should have the same version and should be released together - even if there are no changes in particular distribution. They should have == pinned dependencies to other apache-airflow things.

potiuk avatar Jun 17 '25 10:06 potiuk

We are bound to need at least one separate package if we want to entirely separate dependencies. For example, the WeightRule enum is fundamentally a user-facing type that should live in the sdk, but the default value in BaseOperator if not given explicitly is configurable, and core needs the class to validate the configuration. There are multiple such things on both dag and task level.

This one I think is doable by using the OpenAPI spec -- we already include some extra types that are not used in the schema, but exist purely for the purpose of having the type available automatically in the SDK

ashb avatar Jun 17 '25 11:06 ashb

  • Should the Providers with executors be separated into providers to that they can be included on Server side?

Yes

In my head, separate providers and executors means we'd have distributions like apache-airflow-providers-cncf-kubernetes and apache-airflow-executor-kubernetes. Naming pattern not fixed, but something like that.

(This particular example is a little bit messy, as there is actually an awful lot shared between the KubePodOperator and the KubeExecutor. But right now I'm ignoring that point)

Is this also what you had in mind @potiuk ?

ashb avatar Jun 17 '25 11:06 ashb

  • Testing strategy? How to test package combinations efficiently

There should not be combinations. All packages should have the same version and should be released together - even if there are no changes in particular distribution. They should have == pinned dependencies to other apache-airflow things.

A tight coupling between task-sdk and any "server side" component is the opposite to one of the goals of AIP-72 (I'm not sure we ever explicitly said this, but the first point of motivation for the AIP says "Dependency conflicts for administrators supporting data teams using different versions of providers, libraries, or python packages")

In short, my goal with TaskSDK, and the reason for introducing CalVer and Cadwyn with the execution API is to end up in a world where you can upgrade the Airflow Scheduler/API server interdependently of any worker nodes (with the exception that the server must be at least as new as the clients)

This ability to have version-skew is pretty much non-negotiable to me and is (other than other languages) one of primary benefits of AIP-72

ashb avatar Jun 17 '25 11:06 ashb

A tight coupling between task-sdk and any "server side" component is the opposite to one of the goals of AIP-72 (I'm not sure we ever explicitly said this, but the first point of motivation for the AIP says "Dependency conflicts for administrators supporting data teams using different versions of providers, libraries, or python packages")

I thin we agree here for task-sdk. My == comment was only for non-task-sdk packages (i.e. all packages resulting from splitting airflow-core).

As explained above - task sdk should be fully standalone and should have duplication if needed and should be independently versioned - I wrote it above. And providers should depend exclusively on task-sdk. Task SDK should have everything needed to run any kind of worker and should not depend on any parts of airflow-core including "airflow-common-utils".

But all the rest of packages (core, common-util, api-server if we have it etc.) should be all pinned with the same version - i.e. tested and released together.

potiuk avatar Jun 17 '25 12:06 potiuk

And following what we already do with providers - (but the other way round) we should be running complete suite of tests of all previously released task-sdk versions that we want to supoprt with the new "airflow core" suite of packages and run the tests for those different versions with the "main" version of airflow-coresuite of packages. This is the only way we can actually be more-or-less sure we have not broken something accidentally. There should be a suite of tests on Task-SDK that actually run against running "api-server" locally - whereapi-server" code comes from "main" but "task-sdk" and "task-sdk-tests" come from past versions.

This is also why "task-sdk" should be completely standalone package. And it should be everything that is needed for example by the celerly worker to run tasks, or everything that should be installed in container image for "kubernetes" executor or everything needed to be installed by "edge executor". Generally providers shoudl only ever import from task.sdk and never import anything from other airflow distributions.

In my head, separate providers and executors means we'd have distributions like apache-airflow-providers-cncf-kubernetes and apache-airflow-executor-kubernetes. Naming pattern not fixed, but something like that.

Yes. And that's fine "apacha-aurflow-executor-cncf-kubernetes" can still depend on the code from "apache-airflow-providers-cncf-kubernetes". Which will depend only (and exclusively) on task-sdk. This means that yes - you should install "task-sdk" and "apache-airflow-providers-cncf-kuernetes" (transitively) on scheduler - but this is IMHO completelhy no problem.

potiuk avatar Jun 17 '25 12:06 potiuk

And I do not think we need a spike. For me it's just "common sense" (pun intended).

Determining what is shared is as important and part of the spike as the title mentions :)

task-sdk only. Task-sdk should not depend on common-util. There some duplication is likely but we should handle it at pre-commit level where task sdk code (serialization) is copied to common util for example)

That's the whole point of this discussion though! client is Task SDK here: Do we want to duplicate the seriaization code and module between common & task SDK -- I don't think so, I disagree.

I'd say airflow-*: common-util, scheduler, api-server, triggerer, task-sdk (for all kinds of workers) is probably best approach. And "apache-airflow" as meta-data package installing all of them together.

That's where we disagree too. Server IMO should consists of Scheduler & API-server. They don't need to be separate packages.

There should not be combinations. All packages should have the same version and should be released together - even if there are no changes in particular distribution. They should have == pinned dependencies to other apache-airflow things.

If all packages have pinned versions, you can't independently update client or server. "Dependency conflicts for administrators supporting data teams using different versions of providers, libraries, or python packages" like Ash mentioned

kaxil avatar Jun 17 '25 12:06 kaxil

If all packages have pinned versions, you can't independently update client or server. "Dependency conflicts for administrators supporting data teams using different versions of providers, libraries, or python packages" like Ash mentioned

I think you misunderstood my description I think only task-sdk should be client and it should be "all" you need to run client. Even if it means duplication of some code (serialization) - and you can make duplication smartly by symbolic linking or copying relevant code between distributions in pre-commit.

IMHO 'task.sdk' should be full, standalone, separately versioned thing and the only thing imported by providers and the only thing that is needed to run any kind of worker. You should not need any package other than "airflow-task-sdk" to be installed to make "celery worker" or "kubernetes image" to work. That's the client. Versioned separately. With all "supported" versions (going back few releases following agreed deprecation schedule) tested against current "main" version of server automatically in CI.

I also think the only reason we should split "airflow-core" (and that's the motivation I have) is to shard dependencies - and make different part of the system (schduler, api-server, triggerer ) use potentially different set of dependencies - especially api_server should be split from the rest. So introducing "common" is for me the way how you can have common utils used by all those separate "distributions" to have common utls like logging, conf, etc. etc. - but all of them should always come from the same "commit" when installed togetther (i.e. should be pinned to the same version that they were tested it in "main" and "v3_*" branch).

Otherwise it's going to be a mess.

potiuk avatar Jun 17 '25 13:06 potiuk

I think you misunderstood my description I think only task-sdk should be client and it should be "all" you need to run client. Even if it means duplication of some code (serialization) - and you can make duplication smartly by symbolic linking or copying relevant code between distributions in pre-commit.

Yup onboard with that. We might not need to even duplicate the serialization module between server and client but will need to prove/POC that - I might have created a GH issue for it.

Agreed on Providers relying on Task SDK alone apart from the Executors one. Can make it independent too but just need to think through the BaseExecutor interface which is used in the Scheduler too.


I also think the only reason we should split "airflow-core" (and that's the motivation I have) is to shard dependencies - and make different part of the system (schduler, api-server, triggerer ) use potentially different set of dependencies - especially api_server should be split from the rest. So introducing "common" is for me the way how you can have common utils used by all those separate "distributions" to have common utls like logging, conf, etc. etc. - but all of them should always come from the same "commit" when installed togetther (i.e. should be pinned to the same version that they were tested it in "main" and "v3_*" branch).

Could we maybe tackle and discuss that separately if you agree? The primary reason for me requesting that is by keeping current "core" as Server and Task SDK as client -- once we truly decouple them. We can achieve one user goal for Independent upgrades for Server & client. That from the release POV, won't be much different than how we do it today since they are separate packages. Splitting API-server and Scheduler and other packages would be more involved and we can do that separately in my humble opinion, if you agree.

kaxil avatar Jun 17 '25 13:06 kaxil

And yes - I think we want to split packages for different reasons :) - that's where confusion comes from. you want to split "client vs. server" - but there introducing "common distribution" (even serialization) is going to create a lot of problems - mostly because conflicting dependencies that will inevitably happen when we try to have "common" package working with separately versioned "client" and "server".

So I think we should do everything to make "task-sdk" (client) fully standalone and separate and not sharing any "installable" code with api_server (server). They might share some of the code by duplication, but they should never "import " common code from a shared location. That's a recipe for disaster.

So my motivation for splitting the "core" is completely different. If you want to have one binary covering "api_server", "scheduler" (and let's not forget - "triggerer", and "dag processor") - then it makes almost 0 sense to have "common" distribution. It only makes sense IMHO if we decide to have several distributions - each for each of the "components" and each of those distributions will use a subset of "common" and have a separate resulting set of dependencies it should rely on.

Other than that "task.sdk" should never use that "common code".

One of the problems we are going to have is "triggerer" and "dag processor". Because effectively both of them have to have both "core" parts and "task-sdk" parts installed. If you want to version task-sdk separately and "core" separately and have a common code that will be used by both - and you will try to have "triggerer" and "dag processor" to effectively use:

  • latest core
  • some version of common
  • whatever version of task-sdk - identical and compatible with the "worker" task-sdk.

This is going to spin out of control very very very quickly.

potiuk avatar Jun 17 '25 13:06 potiuk

Agreed on Providers relying on Task SDK alone apart from the Executors one. Can make it independent too but just need to think through the BaseExecutor interface which is used in the Scheduler too.

I think we should have completely separate Executor Base distribution - and executor distributions should depend on it as well as on providers/task.sdk. I think this also can be achieve by having .... cncf-kubernetes-common distribution for example:

  • cncf.kubernetes.provider -> cncf.kubernetes.common + task.sdk
  • cncf.lunernetes.executor -> cncf.kubernetes.common (no task.sdk dependency).

Splitting API-server and Scheduler and other packages would be more involved and we can do that separately in my humble opinion, if you agree.

I do - but I think we should design target architecture we want to achieve even we do it in steps, it will make it easier to make proper decisions if we know the "north star" we aspire to. I am a big fan of at least imagining where I am going when I make first step out of the door, that allows me to make decisions where I go next knowing whether it puts me closer to the goal or farther.

potiuk avatar Jun 17 '25 13:06 potiuk

One of the problems we are going to have is "triggerer" and "dag processor". Because effectively both of them have to have both "core" parts and "task-sdk" parts installed. If you want to version task-sdk separately and "core" separately and have a common code that will be used by both - and you will try to have "triggerer" and "dag processor" to effectively use:

Yup, hasn't been forgotten https://github.com/apache/airflow/issues/51552 :)

kaxil avatar Jun 17 '25 14:06 kaxil

I think we should have completely separate Executor Base distribution - and executor distributions should depend on it as well as on providers/task.sdk. I think this also can be achieve by having .... cncf-kubernetes-common distribution for example:

Yeah, agreed

kaxil avatar Jun 17 '25 14:06 kaxil

Yup, hasn't been forgotten https://github.com/apache/airflow/issues/51552 :)

Yeah - but you can't solve the knot "partially" if you do not know how you are going to solve #51552.

Just to show you how I imagine the target architecture (simplified dependency tree):

* triggerer
         - common
         - task-sdk

* dag-processor
         - common
         - task-sdk

* scheduler
         - common
         - executors (all of them potentially

* api-server
         - common

* celery.executor
         - celery.common
         - common (including executor base)

* kubernetes.executor
         - kubernetes.common
         - common (including executor base)

* kubernetes.provider (implements worker side)
         - kubernetes.common
         - task.sdk

* celery.provider (implements worker side)
         - celery.common 
         - task.sdk

Each of those distributions will use a subset of things from commmon (and use optional extras from common as dependency for things it uses (for example executors will depend on "common[executor, logging, config]" etc.

I believe this is the target setup we should have and if we can agree to it, we might define first and second and third step how to get there.

But we need to agree where we are going first. It cannot be done incrementally to design the final architecture without imagining final setup.

potiuk avatar Jun 17 '25 14:06 potiuk

Yeah - but you can't solve the knot "partially" if you do not know how you are going to solve https://github.com/apache/airflow/issues/51552.

It is part of the same parent GH issue, check the GH dependency.

kaxil avatar Jun 17 '25 14:06 kaxil

It is part of the same parent GH issue, check the GH dependency.

Sure. But I am telling that splitting dag-processor and triggerer out of the common is necessary and you need to know how your target split should look like and what makes sense as a "target". And splitting out common code is IMHO part of the process.

And yes you can try to do it incrementally, without discussing what you reallly need to do to unentangle the knot we have - but I think it's part of the same design process.

potiuk avatar Jun 17 '25 15:06 potiuk

Yup, no disagreement on that

Sure. But I am telling that splitting dag-processor and triggerer out of the common is necessary and you need to know how your target split should look like and what makes sense as a "target". And splitting out common code is IMHO part of the process.

kaxil avatar Jun 17 '25 16:06 kaxil

BTW. Is there a reason we are not discussing that in devlist? I think this is a very important topic - how airflow will be split into distributions and something that will haunt us if we do it wrong. While yes it could be that "whoever takes on the task" will make it and applies their own understanding, anyone doing it might miss a number of consequences - so I think we should design the split and discuss it and possibly even create an AIP for that. There was a document started by @ashb that we initially commented on.

Is there any reason we do not want to share that document and start discussing on it in devlist? I believe it's very clear (at least for me) what problems are still to be solved with task.sdk - circular dependencies of various sorts mainly. And I think when we apply more people's wisdoms and ideas we might come up with a better design and we can discuss long term consequences - especially that there are a number of peopel (including me but also @eladkal @bugraoz93 @gopidesupavan and @amoghrajesh and others who had been experiencing, witnessing and solving problems with common packages and cross dependencies.

This is not a simple thin and if we do it wrong, it will really haunt us. So I'd say we should start devlist discussion on it.

What's stopping us from that ?

potiuk avatar Jun 18 '25 08:06 potiuk

Because this discussion has gone a down a different path to we first imagined mostly.

We'll get something to the dev list "shortly"

ashb avatar Jun 18 '25 15:06 ashb

I think we know what we are doing here and have a pattern to follow.

Closing this issue.

ashb avatar Sep 01 '25 13:09 ashb