argo-workflows
argo-workflows copied to clipboard
3.5 ListWorkflows causes server to hang when there are lots of archived workflows
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issues exists when I tested with
:latest
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
We had >200,000 rows in the workflow archive table, and when trying to view the new combined workflow/archived workflow list page in the UI, the server times out
scanning the code, it looks like the LoadWorkflows
code loads all rows from the archive table, combines them with the k8s results and then applies sorting and limiting.
as a workaround, we've reduced the archive ttl from 14 days to 1 day, and the endpoint now responds before timing out, but is still pretty slow.
Version
v3.5.0
--- edits below by agilgur5 to add updates since this is a (very) popular issue ---
Updates
- Most of the performance regression part of this issue should have been solved by https://github.com/argoproj/argo-workflows/pull/12068 (which did re-instate a different bug: https://github.com/argoproj/argo-workflows/issues/11715), which was released in v3.5.1
- Another performance regression was fixed in #12912, which was released in v3.5.6
- Discussion continues below on other regressions and thoughts on the general merge of the Archived + Live UI in 3.5
- Please help test the new in-memory SQLite DB from #12736 and report your results/feedback here!
Thanks for this issue. This is a known issue if you have a lot of archived workflows. It's caused by the pagination method that first loads all live workflows and archived workflows and then performs pagination. cc @sunyeongchoi who worked on this in https://github.com/argoproj/argo-workflows/pull/11761.
Hello. I will test the issue as soon as possible and think about a solution. thank you.
For posterity, this was actually discussed yesterday at the Contributors Meeting. @jmeridth had been looking into it as it is blocking his team from upgrading as well and had eventually traced it to this PR discussion: https://github.com/argoproj/argo-workflows/pull/11761#discussion_r1317888160.
(I was involved as I made substantial refactors to the UI for 3.5 -- #11891 in particular -- in case those were the cause, but the UI is actually unrelated in this case, and the refactor actually decreased the number of network requests. Also #11840 removed a default date filter, but that was entirely client-side anyway, so did not impact any networking.)
@sjhewitt thank you for filing this. I'm having the same issue. 30+ second page loads. I'm gathering data now and will post here once obtained.
@sunyeongchoi thanks for taking a look.
Can confirm this isn't related to the "past month" default on the UI being removed from the search box like @agilgur5 states. That is only client side.
There are some optimizations we can do, e.g. https://github.com/argoproj/argo-workflows/issues/12030.
However, this issue is more specific to argo-server/backend. The root cause is that the pagination method we implemented for ListWorkflows()
requires retrieval of the entire list of workflows at once.
@sjhewitt @jmeridth Hello. I have a question about the circumstances under which this bug occurred. Does this issue occur as soon as you first enter the Workflows list page? Or does when moving the page?
@suryaoruganti every visit.
@jmeridth Ok! Thanks for your answer.
@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).
I propose that we add a dropdown to allow users to select whether to display:
- Only live workflows;
- Only archived workflows;
- Both live and archived workflows (with notice/warning that this is only suitable when the number of workflows is not too large);
Additional requirements:
- This dropdown box is only available if there are both archived workflows. UI should be smart enough to figure this out.
- We should also consider the ability to disable the third option as admin to avoid bringing down cluster performance.
- The "archived" column should only be displayed when appropriate. For example, it's evident that a workflow is archived if the user is only viewing archived workflows.
Motivations for this proposal:
- Some users only care about one of these types of workflows;
- Since there are performance issues that we cannot get around in order to view both types of workflows, we should only provide this option with caution;
- Keep using the original pagination implementation for live workflows or archived workflows where the logic is much more precise while keeping the front-end codebase simple;
- The first two options are almost identical to previous versions, but the UI should be less buggy since they now share most of the implementation; the third option is an addition to previous versions.
Any thoughts?
Just reverted back to v3.5.0-rc1
due to this. Perhaps we can add the default started time back?
I am reverting related change https://github.com/argoproj/argo-workflows/pull/12068 for now since otherwise the UI is not usable when there are many workflows. In the meantime, we can continue discussing proposal https://github.com/argoproj/argo-workflows/issues/12025#issuecomment-1774346636 here. We can probably release a patch version next week https://github.com/argoproj/argo-workflows/issues/11997#issuecomment-1775681491.
Just reverted back to
v3.5.0-rc1
due to this. Perhaps we can add the default started time back?
If you're referring to #11840, it is mentioned above that that it is unrelated to this issue, as that filter is entirely client side (as the k8s API server has no native way of doing date filtering, though that logic also pre-dates me and that PR)
@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).
What are some of those edge cases? We can still over/under fetch so long as it does not overload the server. For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).
Did the previous Archived Workflows page not have pagination? If so, I would think it would have been similarly susceptible to this, just not as frequently hit since it was a separate page.
4. The first two options are almost identical to previous versions, but the UI should be less buggy since they now share most of the implementation; the third option is an addition to previous versions.
I feel like separate pages is a better UX than a drop-down. If the APIs are identical, some careful refactoring could make them share a UI implementation.
@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).
What are some of those edge cases? We can still over/under fetch so long as it does not overload the server. For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).
I'm similarly curious...
I wonder if it would be possible to use a cursor that encodes 2 offsets - one for the k8s api and one for the db, then fetches limit
rows from both sources with the given offset, merges the results together and applies the limit to that combined list.
something like:
orderBy = ...
filters = ...
limit = 20
cursor = 0, 0
k8sResults = fetchK8s(cursor[0], limit, filters, orderBy)
dbResults = fetchDB(cursor[1], limit, filters, orderBy)
results = mergeResults(k8sResults, dbResults).slice(0, limit)
newK8sOffset = getLastK8sResult(results)
newDBOffset = getLastDbResult(results)
newCursor = (newK8sOffset, newDBOffset)
@agilgur5 @sjhewitt
Hello. I kept thinking of a way to merge Archived
and Workflows
into one page instead of splitting them into two pages.
However, there are two problems that continue to hinder solving this problem.
First, Archived
and Workflows
may overlap with each other, so deduplication is necessary.
Second, we have no control over the pagination of Kubernetes resources(Workflows).
before this issue occurred, Logic removed duplicates by searching 20 Archived Workflows
and 20 Workflows
each when 20 pieces of data were needed on one page.
However, in this case, some Workflows
data may be missing when viewing the next page.
This is because kubernetes' own pagination does not allow us to decide which data to start the next page with.
This problem can be solved if Kubernetes' own pagination logic can be used as is. (Kubernetes code analysis is required)
But if that's not possible, I don't know if there's any other better way.
In conclusion, I thought that in order to know what data to start with on the next page, it would be impossible for us to control this unless we knew exactly the continue
logic(for pagination) of Kubernetes and could implement it in the same way, unless we retrieved the entire Workflows
data.
Do you have any other good suggestions?
ahh, I see - I didn't have much knowledge of the k8s api, so didn't realize it doesn't really support filtering/ordering/pagination.
The 3 options I see are:
- separate the k8s backed api/ui from the db backed api/ui (reverting to previous behaviour)
- fetch the whole k8s data and merge it with a subset of the db data
- persist workflows in the db from the moment they are submitted, updating their state as they are scheduled/change state. then (if the db archive is enabled) make the UI solely reliant on querying from the db. In this case, the data could be augmented with data from the k8s API if the workflows are still running...
There is an idea that suddenly came to mind while writing a comment.
I think this problem can be solved by knowing which data to start with on the next page.
In that case, I thought I could solve the problem by using a function that performs pagination based on the resourceVersion
that I implemented previously.
Existing: Use the cursorPaginationByResourceVersion
function after merging all Archived Workflows
and all Workflows
data.
Suggestion: If 20 pieces of data are needed on one page, fetch 20 Archived Workflows
, fetch 20 Workflows
, and then merge them. Use the cursorPaginationByResourceVersion
function when searching Workflows data for the next page.
However, even with this method, all Workflows
will be fetched on every page. (But not fetch all Archived Workflows
)
I think it will be more efficient than the previous method.
I'll listen to your opinions and if you think it's a good method, I'll try it and share it with you.
it is not only breaking and making unusable UI on v3.5.0, it is also crashing badly with OOM on 3200mb guaranteed deployed pod, with a lot of archived workflows (postgresql) and few at etcd (6 hours TTL on worfkflows with working GC).
The issue is at the main view where all workflows are listed.
Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags "show archived" that you are commenting, would help a lot.
Suggestion: If 20 pieces of data are needed on one page, fetch 20
Archived Workflows
, fetch 20Workflows
, and then merge them. Use thecursorPaginationByResourceVersion
function when searching Workflows data for the next page.
@sunyeongchoi That might be a good optimization for the third option I listed in https://github.com/argoproj/argo-workflows/issues/12025#issuecomment-1774346636. It could help a lot when there are a lot more archived workflows vs live workflows. Although we still need to fetch the entire list of live workflows on all pages.
Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags "show archived" that you are commenting, would help a lot.
@Guillermogsjc Thanks for the suggestion. However, we still need to make the same list query in the backend and then filter in the front-end.
With the unified workflow list API for both live + archived workflows we have to do the following:
- we need all workflows in the cluster because kubernetes API server does not support sorting by creation timestamp, but argo-server does.
- we should only query X workflows from the archive, where X is the requested page size. The underlying database does support filtering and sorting, so this is efficient. The fact that we query everything from archive is nonsensical.
Number 1 is a challenge because performing LIST all workflows for every request will be taxing on K8s API server as users use the UI. Listing all workflows for the purposes of only showing 20 is extremely inefficient. To get around this, I propose a technique we use in Argo CD:
Since LIST all workflows is the requirement and also the problem, we can use an informer cache on Argo API server to avoid the additional LIST calls to k8s and have all workflows ready to be returned by API server from in-memory informer cache. When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.
Note that this does balloon the memory requirements of argo-server because of the informer cache. But I feel this is manageable with archive policy and gc. And the benefits of a unified live+archive API, as well as reducing load on K8s API server, outweigh the extra memory requirements of argo-server.
If we are worried about expensive processing / sorting of the results of List() calls we make against the informer cache, we could consider maintaining our own ordered workflow list (by creationTimestamp) that is automatically updated with UpdateFunc, AddFunc, DeleteFunc registered to the informer. But I consider this an optimization.
a daily partition label on etcd for the workflow objects would be a mad idea? `
workflows.argoproj.io/daily_index: {{workflow.creationTimestamp.%Y-%m-%d}}
Partitioning indexes to allow performant queries on database objects, is often the way to go to avoid monolithic queries that are devastating, as the one you comment on against k8s API/etcd, that does not support by creation timestamp sort and limit.
By having a daily partition label you would be able to build the interested partitions to filter on the Argo server against k8s API.
Anyway, the problem needs to be on that monolithic query against backend database, if it is bringing everything instead of filtering, the load can be enormous into index page.
In our case at least ( following the documentation recommended good practice of short TTLs for workflow objects), where we have tons and tons of archived workflows on postgresql and only last 6 hours workflows from etcd, it is clear that the described full query against database backend is the killer.
Thank you so much for so many people suggest good ideas.
First, I will start with optimizing Archived Workflows first.
we should only query X workflows from the archive, where X is the requested page size. The underlying database does support filtering and sorting, so this is efficient. The fact that we query everything from archive is nonsensical.
After that I will investigate the informer cache :)
When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.
@sunyeongchoi Great! Let me know if you need any help.
Hello. While solving this issue, I encountered a new problem.
the direction we want is to paginate Archived
and Workflows
separately, then remove duplicates and merge them.
However, in this case, when moving to another page, data that has already appeared on the previous page cannot be excluded, so duplicate data may appear.
For example, the number of Archived
is 3(a,b,c) and the number of Workflows
is 5(d,e,a,b,c)
Assume one page requests 5 pieces of data.
Then, the first page will show 3 Archived
(a, b, c) and 2 Workflows
(d, e)
And when you click on the next page, you will see three Workflows
(a, b, c)
As a result, data a, b, and c are duplicated.
But I can't think of a good way to solve this problem, so I'm leaving a comment.
If anyone has any good ideas, please feel free to write them down.
I propose that we add a dropdown to allow users to select whether to display:
- Only live workflows;
- Only archived workflows;
- Both live and archived workflows (with notice/warning that this is only suitable when the number of workflows is not too large);
I think this is the best way.
However, instead of paginate Archived
and Workflows
separately in step 3, fetch all Archived
and Workflows
just once at first and combine them and remove duplicates.
And store that data in the cache. And when go to next page, we use data in the cache.
It is not yet clear whether it is possible to store it in the cache, and should also investigate the informer cache suggested above.
But if possible, I think this is the best way.
Are there any other good opinions?
I think this is the best way.
I mentioned above that I feel a drop-down seems like clunkier UX than a separate page (the previous behavior) to me.
If this is the primary option we are considering, I would ask if I could take some time to take a look through in-depth myself (I was not previously involved with this change). I've built lists with complex sorting, filtering, and pagination with efficient implementations in the past. If needed, usually the worst-case scenario is that you have to send to the API a list of all IDs to exclude (e.g. duplicates).
However, instead of paginate
Archived
andWorkflows
separately in step 3, fetch allArchived
andWorkflows
just once at first and combine them and remove duplicates.
Fetching all is pretty much never an option. There is always an amount at which it can overload a server and is also a big load spike.
Even with an Informer cache (which only handles in-cluster Workflows), it can cause spikiness when the Informer rebuilds the cache periodically as well. Spikiness is generally a behavior to avoid from an operations perspective (my day-to-day work these days is primarily SRE-focused), if possible. Informers are a bit more transparent than generic caches, but even then, it adds a layer of complexity (i.e. more potential for bugs, which we have quite a bit of, some due to Informer misconfigurations) to the Server that we should avoid if possible.
Note that this does balloon the memory requirements of argo-server because of the informer cache.
I think it's worth noting that this balloons the memory requirements of each Argo Server Pod. If you have a standard, 3 replica HA set-up, your memory usage is going to go way up.
Options 1 and 2 are natural and <v3.5.0 arch ones, those should definitely exist to avoid feature regression (currently happened at v3.5.0 where you lost this ability to query either etcd or your backend db).
Best way for 3rd option would skip deduplication and also the forbidden fetch all.
There are two kind of workflow status:
- ongoing: pending and running
- static: successful, failed, error
In a deployment where archiving process is robust, there are no reasons to query static state workflows against etcd. Similarly, there are no ongoing workflows on backend database.
Based on this statement, you can on option 3, just break the query in two disjoint sets:
- ongoing workflows -> etcd
- static workflows -> backend database
When there is needed to show a set on the UI that contains both kinds, exhaust first ongoing list, then go for static list.
The only lose of ergonomics for the user here are the case where static workflows are younger than ongoing ones, as those will be placed without honoring creation timestamp order. But this situation is understandable and the overall balance positive.
For example, the number of
Archived
is 3(a,b,c) and the number ofWorkflows
is 5(d,e,a,b,c)Assume one page requests 5 pieces of data.
Then, the first page will show 3
Archived
(a, b, c) and 2Workflows
(d, e)And when you click on the next page, you will see three
Workflows
(a, b, c)As a result, data a, b, and c are duplicated.
Rather than paginating using an offset, can we use a cursor instead?
e.g. when sorting by startedat ascending, with Archived workflows:
((a, '2023-01-05'), (b, '2023-01-06'), (c, '2023-01-07'))
and Workflows as:
((d, '2023-01-01'), (e, '2023-01-02'), (a, '2023-01-05'), (b, '2023-01-06'), (c, '2023-01-07'))
we can set the cursor to the last value in the list, i.e. 2023-01-07
then when querying for the next page, we can query the database with startedat > 2023-01-07 limit n
and the Workflows can be sorted by startedat, filtered to values > 2023-01-07
and then n
items taken from the top of the list.
we should probably include a second column in the sort/cursor (probably name
or another unique identifier if there is one) so that when sorting a column where there are lots of duplicate values (e.g. phase
) we can be more deterministic about the order.
If this is the primary option we are considering, I would ask if I could take some time to take a look through in-depth myself
Of course! If you do, I would be very grateful.
I heard that Archived
and Workflows
was separated page before and merged them recently. because user should not need to know the difference between two types of workflows.
So I'm also considering third option that showing Archived
and Workflows
together as much as possible.
we can set the cursor to the last value in the list, i.e. 2023-01-07 then when querying for the next page, we can query the database with startedat > 2023-01-07 limit n and the Workflows can be sorted by startedat, filtered to values > 2023-01-07 and then n items taken from the top of the list.
Thanks for your idea and good example. However, I think that this way also creates a situation where duplicate values cannot be removed as follows, unless sorting is performed after the entire search.
First Page as (Archived
):
((a, '2023-01-04'), (b, '2023-01-05'), (c, '2023-01-06'), (d, '2023-01-07'), (e, '2023-01-08'))
Second Page as (Archived
):
((f, '2023-01-01'), (g, '2023-01-02'), (h, '2023-01-03'))
Second Page as (Workflows
): values > 2023-01-03
:
((a, '2023-01-04'), (b, '2023-01-05'))