cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

drt: consider having long-running sessions with some simple workload

Open yuzefovich opened this issue 1 year ago • 9 comments

In order for us to reproduce and catch some bugs (e.g. #121844), it might require to have sessions that are running for weeks and only close when the nodes restart. We cannot really replicate such scenario neither in CI nor in roachtests, and the DRT cluster seems like a perfect fit. We should consider introducing a simple workload that would run continuously / periodically on such long-running-session.

Jira issue: CRDB-37733

yuzefovich avatar Apr 10 '24 20:04 yuzefovich

[triage] does tpcc workload already do this for us? do we run it long enough on DRT cluster? Should DRT team look into this overall?

[michae2] concerned that spot instances that would be going down frequently would be killing long-living sessions preventing implementing a workload like this issue describes.

cc @BabuSrithar @srosenberg

yuzefovich avatar Jul 16 '24 18:07 yuzefovich

We observed on one of the CC clusters that a session that issued 3M txns has "memory usage" reported as around 400MiB.

For the regression test, would it suffice to assert on the memory usage of each long-lived session? Using crdb_internal.node_memory_monitors for observability?

srosenberg avatar Jul 16 '24 20:07 srosenberg

For regression test for that particular bug, yes, we could do that. This issue is more general - about having very long-lived sessions in our tests somewhere since some of our customers never close connections (unless the nodes are restarted).

yuzefovich avatar Jul 17 '24 19:07 yuzefovich

For regression test for that particular bug, yes, we could do that. This issue is more general - about having very long-lived sessions in our tests somewhere since some of our customers never close connections (unless the nodes are restarted).

Yep, it makes sense! I was just confirming the general test strategy. This is a great candidate for long-running clusters. What about perturbations? A long-running cluster will inevitably experience external (and internal) failures; this would make it trickier to keep persistent sessions. (I assume a disconnect would be a deal breaker; i.e., would it resolve the memory leak in this case?)

srosenberg avatar Jul 18 '24 01:07 srosenberg

Totally, I don't expect any kind of guarantee on the lifetime of these sessions, rather it'd be a good addition to our test suite to have sessions that are as long-lived as possible, on a best-effort basis. Any node restart or network hiccups should be ignored. My hope is that even best-effort with no guarantees would be sufficient to tickle some existing bugs / prevent new regressions.

yuzefovich avatar Jul 18 '24 01:07 yuzefovich

Not sure why this was added back to SQL Queries. Removing it from our board since it looks like the DRP team is working on it.

rytaft avatar Aug 20 '24 17:08 rytaft

cockroach workload has --max-conn-lifetime=1h flag which can be used to increase the connection pool active time, by default it is 5m. This flag governs the session active duration time. We can adjust this time for all workloads running on drt-large.

@rytaft What would be the ideal time you want for the session lifetime, should we increase it to like 12h or you want more than that like 24h or 72h?

csgourav avatar Aug 28 '24 06:08 csgourav

Hi @csgourav -- the issue description says "it might require to have sessions that are running for weeks and only close when the nodes restart." So ideally we'd want the sessions to be running as long as possible. Thank you!

rytaft avatar Aug 28 '24 15:08 rytaft

cc @cockroachdb/test-eng

blathers-crl[bot] avatar Aug 29 '24 02:08 blathers-crl[bot]

@rytaft , we can change the workload configuration with --max-conn-lifetime=168h, which is 1 week.

nameisbhaskar avatar May 20 '25 08:05 nameisbhaskar

@rytaft , we can change the workload configuration with --max-conn-lifetime=168h, which is 1 week.

@nameisbhaskar that sounds good, although as noted in the issue description, testing for connections lasting multiple weeks would be ideal. But one week is a great start!

rytaft avatar May 23 '25 14:05 rytaft

cc @cockroachdb/test-eng

blathers-crl[bot] avatar Sep 16 '25 02:09 blathers-crl[bot]