dolphinscheduler [DSIP-91][Api] Workflow debugging and release optimization

Search before asking

[x] I had searched in the DSIP and found no similar DSIP.

Motivation

Current Workflow Issues

Operational Process:

Development Process: For development workflows, the following steps must be executed: 1. Return to the workflow list -> 2. Deploy the workflow -> 3. Run the workflow -> 4. Check the runtime logs.
Modification Process: Adjusting workflow content is more complex: 1. Return to the workflow list -> 2. Undeploy the workflow -> 3. Enter workflow editing -> 4. Return to the workflow -> 5. Deploy the workflow -> 6. Run the workflow -> 7. Check the runtime logs.

Technical/Design Factors:

The workflow lacks a development state. While this simplifies logic and reduces bugs, it shifts complexity to the user.
The design prioritizes production stability, requiring users to be more cautious when making adjustments.

Desired：

Operational Process Optimization: Introduce a dedicated development workflow, separate from the production workflow (which can only be deployed from the development environment and cannot be created directly).

Development Workflow Process: Only requires: 1. Modify workflow content -> 2. Run the workflow -> 3. Check the runtime logs.
Deployment Process: Publish a development workflow to the production workflow (overwriting if previously deployed).

Technical/Design Optimization:

Development and production workflows are stored in separate tables with completely independent data.
Development workflows lack a deployment state, allowing debugging. Running workflows and task instances are isolated from production.
During deployment, the current version of the workflow is copied to the production workflow record, with an association logged (for subsequent overwrites).

Design Detail

The workflow adds a label field to mark more information, such as whether it is a development or production workflow, which can also be customized

The workflow operation adds an overwrite (replace) operation: click and select the workflow to overwrite the non-online state (can be across projects)，It supports the comparison of differences between two nodes, the coverage is for the entire workflow.

The workflow supports label filtering

Compatibility, Deprecation, and Migration Plan

There is a compatibility issue. This table is needed. Data migration needs to be considered. Complement processing is required in init-job.

Test Plan

Old version workflow runs stock data upgrade verification Functional testing UT and E2E modification test

Code of Conduct

[x] I agree to follow this project's Code of Conduct

Jun 12 '25 09:06 det101

Please using english in screenshots. @det101

Jun 12 '25 09:06 SbloodyS

Please using english in screenshots. @det101

oky

Jun 12 '25 09:06 det101

Development and debugging are inconvenient.

From my understanding, do you want to test workflows and tasks in a test environment and release it to a production environment? Feel free to correct me if I'm wrong.

Jun 12 '25 09:06 SbloodyS

Development and debugging are inconvenient.

From my understanding, do you want to test workflows and tasks in a test environment and release it to a production environment? Feel free to correct me if I'm wrong.

Yes, I hope that I don’t need to frequently switch online operations during debugging, and I can release it to production after debugging is completed.

Jun 12 '25 09:06 det101

Yes, I hope that I don’t need to frequently switch online operations during debugging, and I can release it to production after debugging is completed.

In this case, I think we should create a namespace to isolate the environment, instead of transforming existing workflows and tasks. Because in addition to these, others, such as data source center, resource center, queue, tenant, workgroup, permissions and so on, need to be isolated. And then we should create a way to publish from one namespace to another.

Jun 12 '25 10:06 SbloodyS

Yes, I hope that I don’t need to frequently switch online operations during debugging, and I can release it to production after debugging is completed.

In this case, I think we should create a namespace to isolate the environment, instead of transforming existing workflows and tasks. Because in addition to these, others, such as data source center, resource center, queue, tenant, workgroup, permissions and so on, need to be isolated. And then we should create a way to publish from one namespace to another.

My proposal is mainly to optimize the debugging and release process under the existing system. Of course, if namespaces are introduced, the isolation may be more thorough, and the isolation of the data warehouse environment may also need to be considered. This is a huge project that requires more people to participate.

Jun 12 '25 10:06 det101

My proposal is mainly to optimize the debugging and release process under the existing system. Of course, if namespaces are introduced, the isolation may be more thorough, and the isolation of the data warehouse environment may also need to be considered. This is a huge project that requires more people to participate.

A more comprehensive and perfect scheme should be considered. In the future, when we continue to expand related functions, we don't need to carry out comprehensive reconstruction. I don't think it is advisable to bury a hole in the future in order to introduce a feature.

Jun 13 '25 01:06 SbloodyS

My proposal is mainly to optimize the debugging and release process under the existing system. Of course, if namespaces are introduced, the isolation may be more thorough, and the isolation of the data warehouse environment may also need to be considered. This is a huge project that requires more people to participate.

A more comprehensive and perfect scheme should be considered. In the future, when we continue to expand related functions, we don't need to carry out comprehensive reconstruction. I don't think it is advisable to bury a hole in the future in order to introduce a feature.

I learned about DataWorks, and it doesn't have the concept of projects. Its namespace is the workflow. In addition, its project space is physically isolated, which is reasonable on the public cloud. Can we balance the benefits and complexity of introducing namespaces on the project? And it is impossible to achieve physical isolation. There should be no public cloud model for using DS (of course, we have also modified the DS cloud, and the worker is opened on the user vpc, but there is no problem with the workspace for the time being).

我学习了下dataworks，他没有项目的概念，他的命名空间下面就是工作流。另外他的项目空间是做物理隔离的，这在公有云上是合理的；我们在项目上面再引入命名空间能带来的收益和复杂度是否能平衡，而且也无法实现物理隔离，使用DS应该还没有公有云的模式（当然我们也把DS云上做了改造，worker在用户vpc上开，但还没有引工作空间暂时也没有问题）

Jun 13 '25 03:06 det101

Yes, I hope that I don’t need to frequently switch online operations during debugging, and I can release it to production after debugging is completed.

In this case, I think we should create a namespace to isolate the environment, instead of transforming existing workflows and tasks. Because in addition to these, others, such as data source center, resource center, queue, tenant, workgroup, permissions and so on, need to be isolated. And then we should create a way to publish from one namespace to another.

We can first implement a simple debug execution channel for workflows, which works in the same way as running workflows after deployment. It should reuse existing parameters such as data sources and YARN queues. A distinction between debug-mode workflows and production-mode workflows should be made in the database tables.

As a future enhancement, during debug executions, we can choose to use test-specific resources — for example, replacing the production database with a dev database in Hive data sources within the task scripts of the workflow, or switching the YARN queue to a dev queue.

Jun 16 '25 10:06 dill21yu

Yes, I hope that I don’t need to frequently switch online operations during debugging, and I can release it to production after debugging is completed.

In this case, I think we should create a namespace to isolate the environment, instead of transforming existing workflows and tasks. Because in addition to these, others, such as data source center, resource center, queue, tenant, workgroup, permissions and so on, need to be isolated. And then we should create a way to publish from one namespace to another.

@SbloodyS @det101

Resource isolation represents a complex project that likely exceeds the intrinsic capabilities of DolphinScheduler itself. As a workflow scheduling system, it may not be the optimal component for this responsibility, or resource isolation might not inherently fall within the core purview of a scheduler. DolphinScheduler should remain focused on its primary target scenarios.

Regarding the optimization for debugging scenarios mentioned in the Issue:

Disregarding resource isolation considerations, would providing a synchronized replication and overwrite functionality for workflows suffice?

Consider these practical deployment scenarios:

Create a production (prod) project containing production workflows.

Create a development (dev) project containing development workflows.

Upon successful validation of a development workflow, enable one-click synchronization to overwrite its corresponding workflow in the production project.

Alternatively:

Within the same project, create both a development (dev) workflow and a production (prod) workflow.

Upon successful validation of the development workflow, enable one-click synchronization to overwrite the production workflow within the same project.

(Note: The terms "production (prod)" and "development (dev)" here serve solely as environmental labels for identification purposes and do not imply resource isolation.)

Jun 16 '25 11:06 Gallardot

Disregarding resource isolation considerations, would providing a synchronized replication and overwrite functionality for workflows suffice?

Consider these practical deployment scenarios:

Create a production (prod) project containing production workflows.

Create a development (dev) project containing development workflows.

Upon successful validation of a development workflow, enable one-click synchronization to overwrite its corresponding workflow in the production project.

I think this way is a simpler and more maintainable scheme. And we can add the function of cross-project synchronization to user permissions management, so that users can selectively configure the permissions of this function.

Jun 16 '25 13:06 SbloodyS

Yes, I hope that I don’t need to frequently switch online operations during debugging, and I can release it to production after debugging is completed.

In this case, I think we should create a namespace to isolate the environment, instead of transforming existing workflows and tasks. Because in addition to these, others, such as data source center, resource center, queue, tenant, workgroup, permissions and so on, need to be isolated. And then we should create a way to publish from one namespace to another.

@SbloodyS @det101

Resource isolation represents a complex project that likely exceeds the intrinsic capabilities of DolphinScheduler itself. As a workflow scheduling system, it may not be the optimal component for this responsibility, or resource isolation might not inherently fall within the core purview of a scheduler. DolphinScheduler should remain focused on its primary target scenarios.

Regarding the optimization for debugging scenarios mentioned in the Issue:

Disregarding resource isolation considerations, would providing a synchronized replication and overwrite functionality for workflows suffice?

Consider these practical deployment scenarios:

Create a production (prod) project containing production workflows.

Create a development (dev) project containing development workflows.

Upon successful validation of a development workflow, enable one-click synchronization to overwrite its corresponding workflow in the production project.

Alternatively:

Within the same project, create both a development (dev) workflow and a production (prod) workflow.

Upon successful validation of the development workflow, enable one-click synchronization to overwrite the production workflow within the same project.

(Note: The terms "production (prod)" and "development (dev)" here serve solely as environmental labels for identification purposes and do not imply resource isolation.)

@Gallardot @SbloodyS I think this is a good plan. I also want to confirm whether the (dev) workflow and the (prod) workflow appear in pairs. Does the (dev) workflow mean that it can be debugged and not put into production? Compared to the current version of the workflow, is the (prod) workflow just adding a label?

Jun 17 '25 01:06 det101

I also want to confirm whether the (dev) workflow and the (prod) workflow appear in pairs. Does the (dev) workflow mean that it can be debugged and not put into production? Compared to the current version of the workflow, is the (prod) workflow just adding a label?

What we provide is just a function of copying workflows with one click across projects. The name and usage of each project can be customized by users. The workflow under the project remains consistent with the existing one.

Jun 17 '25 02:06 SbloodyS

I also want to confirm whether the (dev) workflow and the (prod) workflow appear in pairs. Does the (dev) workflow mean that it can be debugged and not put into production? Compared to the current version of the workflow, is the (prod) workflow just adding a label?

What we provide is just a function of copying workflows with one click across projects. The name and usage of each project can be customized by users. The workflow under the project remains consistent with the existing one.

If the (dev) process does not support quick debugging, I think the two workflows will not be very useful.

Jun 17 '25 03:06 det101

If the (dev) process does not support quick debugging, I think the two workflows will not be very useful.

DolphinScheduler is a workflow scheduling tool. Not the IDE of the code. It provides workflow-level debugging, not the actual content of the task's code.

Jun 17 '25 03:06 SbloodyS

DolphinScheduler is a workflow scheduling tool. Not the IDE of the code. It provides workflow-level debugging, not the actual content of the task's code.

Currently, the debugging process for workflows is somewhat cumbersome, as it involves steps such as taking the workflow offline, editing the workflow and node content, putting the workflow back online, and then running the debug. Might we consider optimizing the debugging process for the (dev) workflow? Specifically, we could streamline it to allow for direct debugging after editing, without the need for offline and online transitions. This adjustment would enhance debugging efficiency. What are your thoughts on this?

Jun 17 '25 03:06 det101

Search before asking

[x] I had searched in the DSIP and found no similar DSIP.

Motivation

Current Workflow Issues

Operational Process:

Development Process: For development workflows, the following steps must be executed: 1. Return to the workflow list -> 2. Deploy the workflow -> 3. Run the workflow -> 4. Check the runtime logs.

Modification Process: Adjusting workflow content is more complex: 1. Return to the workflow list -> 2. Undeploy the workflow -> 3. Enter workflow editing -> 4. Return to the workflow -> 5. Deploy the workflow -> 6. Run the workflow -> 7. Check the runtime logs.

Technical/Design Factors:

The workflow lacks a development state. While this simplifies logic and reduces bugs, it shifts complexity to the user.

The design prioritizes production stability, requiring users to be more cautious when making adjustments.

Desired：

Operational Process Optimization: Introduce a dedicated development workflow, separate from the production workflow (which can only be deployed from the development environment and cannot be created directly).

Development Workflow Process: Only requires: 1. Modify workflow content -> 2. Run the workflow -> 3. Check the runtime logs.

Deployment Process: Publish a development workflow to the production workflow (overwriting if previously deployed).

Technical/Design Optimization:

Development and production workflows are stored in separate tables with completely independent data.

Development workflows lack a deployment state, allowing debugging. Running workflows and task instances are isolated from production.

During deployment, the current version of the workflow is copied to the production workflow record, with an association logged (for subsequent overwrites).

In order to focus more on the problem to be solved, I updated the problem analysis and solution description. Please judge it. @SbloodyS @Gallardot @danielfree @dill21yu @ruanwenjun

Jul 03 '25 11:07 det101

The workflow adds a label field to mark more information, such as whether it is a development or production workflow, which can also be customized

I understand that label is bound to workflow and has nothing to do with tasks. Am I right?

The workflow operation adds an overwrite (replace) operation: click and select the workflow to overwrite the non-online state (can be across projects)，It supports the comparison of differences between two nodes, the coverage is for the entire workflow.

When performing this override, if the destination workflow does not exist,how should the new workflow be named and what is the user interaction process?

What this function Compare specifically compares, please elaborate it.

The workflow supports label filtering.

Maybe we should add a new search drop-down box instead of using the current filter box. And create an workflow label management menu to manage all workflow labels in project level.

Development and production workflows are stored in separate tables with completely independent data.

Why not add a label column in the existing table since label can be customized by users and can only be used to distinguish between development and production environments.

Jul 11 '25 02:07 SbloodyS

I understand that label is bound to workflow and has nothing to do with tasks. Am I right?

that’s correct.

When performing this override, if the destination workflow does not exist,how should the new workflow be named and what is the user interaction process?

That's a great question. We could add a "Create New" option in the dropdown menu. This would allow users to quickly create a new workflow with a name based on the current workflow and a label like "Production."

Maybe we should add a new search drop-down box instead of using the current filter box. And create an workflow label management menu to manage all workflow labels in project level.

Adding a new dropdown menu for filtering is a good idea.
I just want to confirm that the label management you mentioned is to be added within the project management section, right?

Why not add a label column in the existing table since label can be customized by users and can only be used to distinguish between development and production environments.

The current design doesn't require creating a new table; just adding a field will do.

What this function Compare specifically compares, please elaborate it.

When selecting a workflow to be overwritten, a list of nodes of the two workflows will be listed, and by default they are matched into a pair by name; users can select a node on each side to compare the content differences.

It would be best to directly display the nodes that have changed in the list, but this implementation may be a bit complicated (maybe a separate issue can be raised); if there is a better solution, I hope you can guide me.

Jul 11 '25 06:07 det101

Adding a new dropdown menu for filtering is a good idea. I just want to confirm that the label management you mentioned is to be added within the project management section, right?

Yes. This is just my personal opinion. Need to be discussed. @Gallardot @ruanwenjun @zhongjiajie

When selecting a workflow to be overwritten, a list of nodes of the two workflows will be listed, and by default they are matched into a pair by name; users can select a node on each side to compare the content differences.

It would be best to directly display the nodes that have changed in the list, but this implementation may be a bit complicated (maybe a separate issue can be raised); if there is a better solution, I hope you can guide me.

I think we don't need to compare two workflows content. Just create/update the target workflow definition is enough.

Jul 11 '25 06:07 SbloodyS

I think we don't need to compare two workflows content. Just create/update the target workflow definition is enough.

ok

Jul 14 '25 02:07 det101