metaflow icon indicating copy to clipboard operation
metaflow copied to clipboard

Argo Workflows template exceeds max size with preceding large foreach

Open saikonen opened this issue 1 year ago • 8 comments

The task_id's of the preceding foreach steps for a join appear multiple times in the Argo Workflow init containers ARGO_TEMPLATE, which bloats up the size significantly. With wide foreaches the template exceeds the max size, leading to a broken flow.

Possibly a regression bug. Some discussion on the initial report here: https://outerbounds-community.slack.com/archives/C02116BBNTU/p1694529680541219

saikonen avatar Sep 14 '23 15:09 saikonen

Hi, by any chance do you have an update about this one? 🙏

alexflorezr avatar Sep 27 '23 09:09 alexflorezr

What is the status on this issue?

tslott avatar Dec 01 '23 12:12 tslott

sorry for the long delay, looking into a fix for this now.

saikonen avatar Dec 07 '23 20:12 saikonen

Opened a first attempt for remedying the issue regarding duplicating the input-paths parameters in ARGO_TEMPLATE. Managed to rid it of duplication, but this will not solve the core issue where Argo wants to materialise the value of a Parameter into the template environment variable. In the mean time, the removal of the duplicates should bump the maximum number of foreach splits significantly, where previously the flows were failing at joins of ~2k tasks

For a future improvement which will solve the issue completely, I'm going to look into passing the input-paths through the datastore instead, but this is a bit of a bigger overhaul in general as it needs to work across cloud providers

Alternative solutions and their shortcomings: I looked into changing the input-paths to work as an input Artifact instead of a Parameter. At first this was promising, especially since Artifacts support their value being set inline as raw-data. This however works the same way as Parameters, where the value gets materialised into the template envvar

Another option would've been to use a storage backend for the artifacts, for example S3. This requires extra configuration on the Argo infrastructure side however, and complicates the setup unnecessarily. Setting up artifact storage might also not be possible for some deployments, which would lead to completely breaking existing functionality

saikonen avatar Dec 14 '23 19:12 saikonen

Thx for looking into the issue 👌

In the mean time, the removal of the duplicates should bump the maximum number of foreach splits significantly, where previously the flows were failing at joins of ~2k tasks

I wonder what is the new upper limit? Just an estimate.

tslott avatar Dec 15 '23 10:12 tslott

Upper limit would seem to be between 3500-4500 tasks with the changes, where 3500 passes but 4500 failed. This is a slight improvement on what was previously supported, but there are some concerns as its now reading directly from the ARGO_TEMPLATE environment variable.

  • approach does not solve scaling issues completely
  • there are some proposals in the Argo project of offloading the environment variable completely, which would lead to this approach breaking in the future

saikonen avatar Dec 15 '23 14:12 saikonen

In case it is of interest, it looks like the original issue in Argo Workflows has now been fixed - https://github.com/argoproj/argo-workflows/pull/12325

roofurmston avatar Jan 31 '24 07:01 roofurmston

Hi, is there any plans to release this? I saw that there is merged PR in argo which should solve this issue. However, it was not released this time and I am not sure if they have any plan to release it soon 🤔

alexflorezr avatar Mar 04 '24 15:03 alexflorezr