argo-workflows
argo-workflows copied to clipboard
3.5 ListWorkflows causes server to hang when there are lots of archived workflows
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issues exists when I tested with
:latest
- [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
We had >200,000 rows in the workflow archive table, and when trying to view the new combined workflow/archived workflow list page in the UI, the server times out
scanning the code, it looks like the LoadWorkflows
code loads all rows from the archive table, combines them with the k8s results and then applies sorting and limiting.
as a workaround, we've reduced the archive ttl from 14 days to 1 day, and the endpoint now responds before timing out, but is still pretty slow.
Version
v3.5.0
--- edits below by agilgur5 to add updates since this is a (very) popular issue ---
Updates
- Most of the performance regression part of this issue should have been solved by https://github.com/argoproj/argo-workflows/pull/12068 (which did re-instate a different bug: https://github.com/argoproj/argo-workflows/issues/11715), which was released in v3.5.1
- Another performance regression was fixed in #12912, which was released in v3.5.6
- Discussion continues below on other regressions and thoughts on the general merge of the Archived + Live UI in 3.5
- Please help test the new in-memory SQLite DB from #12736 and report your results/feedback here!