nomad icon indicating copy to clipboard operation
nomad copied to clipboard

Nomad UI painfully slow when job counts goes from hundreds to thousands

Open djenriquez opened this issue 2 years ago • 4 comments

Nomad version

Output from nomad version Nomad v1.3.3 (428b2cd8014c48ee9eae23f02712b7219da16d30)

Operating system and Environment details

Amazon Linux release 2 (Karoo)

Issue

We have a particular use case where Nomad is used to orchestrate full sandboxes for our developers in our development environment. These sandboxes represent our complete stack of services, which means ~100 jobs, including periodic batch jobs.

The higher the number of total jobs, the slower the Nomad UI becomes. Initially, we thought this might be an issue with the actual Nomad servers handling the sheer amount of work, but thats is not the case. Nomad's core is able to handle, at one point, over 10,000 jobs /w ~maybe 50,000 allocations just fine. RPC calls through its API were responsive and the metrics we track showed no struggle whatsoever.

However, the UI was a different story, as it would sit on the Nomad loader graphic for a period of time that seemed to grow linearly with the amount of jobs being run. Interestingly the API requests the UI made to the Nomad servers were responsive, according to chrome dev tools, providing supporting evidence that the backend is not the issue.

Also, when looking at the waterfall chart from chrome dev tools, we see a call to /v1/namespaces?index=1 that eventually is canceled by the browser. Not sure if this request is misleading, but the page renders once that request pops up in the network analyzer, so it seems there is some blocking call at that part of the flow. Screen Shot 2022-10-03 at 1 34 34 PM

Reproduction steps

Spin up atleast 1000 jobs /w ~3000 allocations then navigate to the UI.

Expected Result

UI load time grows proportionately with the API response time for requests made to the Nomad server.

Actual Result

UI load time degrades as more jobs and allocations are running on the Nomad cluster while the API responds performantly.

We're open to scheduling a remote session if that makes it easier to see the issue.

djenriquez avatar Oct 03 '22 20:10 djenriquez

Hi @djenriquez, thanks for raising this — we'll take a look and update this once we have more info.

philrenaud avatar Oct 03 '22 21:10 philrenaud

Hey @djenriquez! Nice to meet you. We're super grateful that you raised this issue and it looks like the Nomad Community at large is also noticing this problem.

We're noticing that the issue may be the result of JavaScript Promises on the /jobs and /jobs/:jobId views are starving the event loop. We investigated the issue along with possible solutions and we have 2 commits that you can pull down:

For the /jobs/:jobId (The Job Detail Overview page) we're very confident that this commit will resolve that problem.

But for the /jobs (The Main Jobs List page) we tried to implement our pagination logic. There will be some regressions because we're mixing server and client-side filtering and sorting now. You can try out this commit.

We're very excited to work with you to find the right solution and we welcome any and all feedback about how you're searching and filtering for jobs (along with any feedback about the Nomad UI). We're in the process of planning a lot great new features into the UI and we're eager to solve any big challenges or even small "papercuts" that you're experiencing.

I'll be heading out on vacation soon, but I'll try my best to be responsive today and tomorrow on this issue and revisit this when I return. Looking forward to hearing from you!

Life is so rich, Jai

ChaiWithJai avatar Oct 27 '22 18:10 ChaiWithJai

Hi @ChaiWithJai, thanks so much for providing these commits. I'll go ahead see how I might be able to plug this into our current system and verify its results. It will likely be next week when I can provide results, however.

djenriquez avatar Oct 27 '22 21:10 djenriquez

Hey @djenriquez! I'm back in the office and wanted to circle back up with you. Were you able to try these commits out?

ChaiWithJai avatar Nov 11 '22 15:11 ChaiWithJai

Hi @ChaiWithJai I realize I dropped the ball on checking back on this issue. Are we able to reconvene?

djenriquez avatar May 03 '23 18:05 djenriquez

Greetings! Is there any update to the fix? The UI is slowing down to a halt whenever there are more than thousand jobs(including dead jobs) in the cluster.

jhyx2022 avatar May 22 '23 22:05 jhyx2022

Looks like theres a PR: https://github.com/hashicorp/nomad/pull/14989, looking to test this out against 1.5.3, just need quick confirmation on compatibility /w https://github.com/hashicorp/nomad/pull/14989#issuecomment-1558185422.

djenriquez avatar May 22 '23 23:05 djenriquez

Dropping a note to say that this is something we intend to prioritize soon; see https://github.com/hashicorp/nomad/pull/14989#issuecomment-1563008793 for a little more context.

philrenaud avatar May 25 '23 14:05 philrenaud

Dropping a note to say that this is something we intend to prioritize soon; see #14989 (comment) for a little more context.

Hi there, is there an update on the fix yet or expected version for the fix? Thanks!

jhyx2022 avatar Jan 24 '24 04:01 jhyx2022

@jhyx2022 Serendipitous timing! We've been developing a new endpoint to complement /jobs that will should make things a lot snappier. You can follow along with a few of the issues:

  • https://github.com/hashicorp/nomad/issues/19339
  • https://github.com/hashicorp/nomad/issues/19806

These should have the effect of a more limited initial pull of jobs on the main index in the UI. There'll still be the ability to paginate, search, and filter your list down, but those functions will no longer be front-end dependent.

philrenaud avatar Jan 24 '24 14:01 philrenaud

Great news, appreciate the update!

jhyx2022 avatar Jan 24 '24 23:01 jhyx2022

Thanks to everyone for your patience on this issue. Pleased to say that https://github.com/hashicorp/nomad/pull/20452 is now merged and will be releasing in the upcoming Nomad 1.8. Among other things, it handles pagination for the jobs index and doesn't overload itself with child jobs that eat up memory at index level. I hope that this makes the overall experience of using the web UI much smoother!

philrenaud avatar May 09 '24 14:05 philrenaud