argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

3.5 ListWorkflows causes server to hang when there are lots of archived workflows

Open sjhewitt opened this issue 8 months ago • 111 comments

Pre-requisites

  • [X] I have double-checked my configuration
  • [X] I can confirm the issues exists when I tested with :latest
  • [ ] I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We had >200,000 rows in the workflow archive table, and when trying to view the new combined workflow/archived workflow list page in the UI, the server times out

scanning the code, it looks like the LoadWorkflows code loads all rows from the archive table, combines them with the k8s results and then applies sorting and limiting.

as a workaround, we've reduced the archive ttl from 14 days to 1 day, and the endpoint now responds before timing out, but is still pretty slow.

Version

v3.5.0

--- edits below by agilgur5 to add updates since this is a (very) popular issue ---

Updates

  • Most of the performance regression part of this issue should have been solved by https://github.com/argoproj/argo-workflows/pull/12068 (which did re-instate a different bug: https://github.com/argoproj/argo-workflows/issues/11715), which was released in v3.5.1
  • Another performance regression was fixed in #12912, which was released in v3.5.6
  • Discussion continues below on other regressions and thoughts on the general merge of the Archived + Live UI in 3.5
    • Please help test the new in-memory SQLite DB from #12736 and report your results/feedback here!

sjhewitt avatar Oct 17 '23 23:10 sjhewitt