cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: add roachtest.Operation and run-operation command

Open itsbilal opened this issue 1 year ago • 8 comments

This change adds the ability to write roachtest Operations that, unlike a full-on roachtest.Test, do not require an ephemeral cluster to be spun up or down. These operations are expected to have as few logical side effects as possible and can be run on a cluster with running workloads. Running an Operation using roachtest run-operation also guarantees that the Cockroach/Workload binaries on that node will not be swapped with local ones, and that the cluster won't be wiped unintentionally at the end or in case of error.

This change also adds add-index and add-column as two example operations that operate in SQL land and demonstrate the purpose of an Operation.

Epic: none.

Release note: None.

itsbilal avatar Jan 26 '24 20:01 itsbilal

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar Jan 26 '24 20:01 blathers-crl[bot]

This change is Reviewable

cockroach-teamcity avatar Jan 26 '24 20:01 cockroach-teamcity

@herkolategan / whoever from test-eng reviews this, happy to first take pointers on whether the general approach here is sound. Over the next bit I'll be cleaning up some of this code so there might be some rough edges if you do a full review right now, but I definitely want to first know if y'all are on board with this approach. Also, it works!

itsbilal avatar Jan 26 '24 20:01 itsbilal

I'm missing a bit of context, I believe, being outside DRT: what is the envisioned way these operations would be called? Directly by name? A random sample? Right now they run concurrently -- what if they are incompatible?

If we are just not handling these situations for the time being, that's fine too, but it's not clear what's in and out of scope.

I haven't looked at the code in detail, but enough to get the general picture. A couple of initial/high level comments:

  • I like the notion of "operation", and it's something we have discussed in various forms and in different times/contexts too. I also like that this has a "cleanup" counterpart.

  • It would be a very nice feature if operations could be trivially called from within a roachtest. Added bonus if there's an easy way to pick a random operation as well.

  • Do we need operation to be a completely different concept in roachtest? From my understanding so far, an operation is a "test" that runs on a cluster that has been previously provisioned and where cockroach is running. Everything else seems mostly the same. Given the very large amount of code being duplicated in the current implementation, I'm very worried about maintenance cost.

Have you thought about or explored an implementation where an operation is a regular "test", appropriately marked maybe with something like TestSpec.Operation, and adapting the existing test runner to skip the cluster provisioning part when running operations?

renatolabs avatar Jan 30 '24 21:01 renatolabs

Thanks for the high-level review Renato, that's exactly what I needed!

It would be a very nice feature if operations could be trivially called from within a roachtest. Added bonus if there's an easy way to pick a random operation as well.

Yes, those are all great points - I do want Operation to be abstract enough to be easily callable from a Roachtest to reduce code duplication down the line. I do foresee that as one of the three ways to call an Operation, the other two being roachtest run-operation (implemented here), and the third being called through the yet-to-be-built workload-manager or drt-run that will randomly run multiple operations on an existing cluster that has some long running workloads running on it. I expect run-operation to mostly be used just to test out an operation. The workload-manager pathway is the one that DRT will use the most extensively.

Do we need operation to be a completely different concept in roachtest? From my understanding so far, an operation is a "test" that runs on a cluster that has been previously provisioned and where cockroach is running. Everything else seems mostly the same. Given the very large amount of code being duplicated in the current implementation, I'm very worried about maintenance cost.

This is a good point, and my thought process on it was that Operations will (over time) diverge greatly from roachtests, in their spec and in the ways they're called. The test-runner stuff is the bulk of duplicated code at the moment and I'm not even sure if it'll get much use outside of the aforementioned run-operation command. Some examples of ways in which Operation specs will diverge from Test specs:

  1. Operations could be passed things like the cockroach start commands to use, DBs already initialized (eg. if they do something specific to a db eg. drop and reimport TPCC), etc.
  2. Operations could specify dependencies - they need a node to have a live cockroach process running on it, all the way to having a db populated with data.
  3. Operations can be interleaved or run concurrently on the same cluster - and by interleaved I mean that we can add a column, add an index, stall a node, and then run the cleanups of them in some order, resulting in a greater explosion in combinatorial state.

Does this make sense? Let me know if you want me to simplify the code and reduce duplication; y'all are code owners of this area so I'm happy to follow your lead on this. I can explore having Operation specs that remain separate but reusing more of the test runner code than we currently do, to try to reduce code duplication there.

itsbilal avatar Jan 31 '24 15:01 itsbilal

the third being called through the yet-to-be-built workload-manager or drt-run

I expect run-operation to mostly be used just to test out an operation

This brings back the first question I asked (what is the envisioned way these operations would be called?); it's still not clear to me, and the sentences above made me more confused. I initially thought runOperations would have that logic, as it runs a set of operations currently. But it seems there would be another component also scheduling ops.

If there's notes about the intended design or vision for how every piece would work, I think that would really help understanding the bigger picture.

Some examples of ways in which Operation specs will diverge from Test specs:

It's fine for operations to be different from regular tests in those ways, but IMO that does not justify an implementation where we introduce so much duplication of non-trivial code (~1,000 lines of code that is mostly the same). As it is, it's hard to tell (unless you're very familiar with roachtest) what's needed for an operation to run, vs. what was just copied from the regular test runner. By skimming the code, I can see lots of logic that is unnecessary in the context of operations. I don't see a sustainable way to keep these implementations "synchronized".

If this implementation was merged mostly as-is, what are we supposed to do as we make improvements to the test runner or test interfaces? A: try to make the equivalent change in the operations counterpart? or B: leave the operations counterpart alone as "it's mostly used to test an operation"? I don't think either of these options would be acceptable: A involves a lot of manual work from the owners (I assume Test Eng would eventually inherit ownership), whereas B would lead to frustration as teams would notice that tests and operations look the same, but don't behave the same.

My point boils down to a wish that the "lines added" part of this diff were all dealing with what makes an operation different from a test (how it can take parameters, how we need to evaluate dependencies, etc), and not incidental complexity of duplicating large interfaces.

renatolabs avatar Jan 31 '24 19:01 renatolabs

This brings back the first question I asked (what is the envisioned way these operations would be called?); it's still not clear to me, and the sentences above made me more confused. I initially thought runOperations would have that logic, as it runs a set of operations currently. But it seems there would be another component also scheduling ops.

That's a good point and something that is not obvious given it was really only discussed within the DRT team; that's an oversight on my part and I should have clarified it here. The idea is that run-operation is just to test out an operation like I said, and most scheduling/running of operations will happen in a different binary (drt-run likely) that will also run long-running workloads in the same place. OperationSpec should already be designed with that latter use-case in mind, and you make a good point about the bulk of testRunner changes being orthogonal to it and are really only relevant for run-operation.

I've significantly slimmed this PR and we now convert (in a best-effort way) OperationSpec to a TestSpec for use with the test runner. I expect the drt-run command to not rely on this conversion and to directly schedule Operations, and any Operation-specific complexity that isn't related to Tests will end up staying in OperationSpec and will only be used by drt-run. My main worry continues to be that at some point this conversion will become too cumbersome and/or result in very different behaviour when an operation is run through the test runner vs. through drt-run, but that's a worry I can defer for now to get this going and to get more operations on the way. One good thing about this new approach is that running operations as sub-tests within a roachtest is a little easier.

This should be in line with your expectations and should be much easier for TestEng to own. My understanding is that, as far as ownership goes, the long-term DRT team will own drt-run, individual operations will be owned by the relevant team writing these operations, and Test Eng will likely own just the run-operation command.

itsbilal avatar Feb 07 '24 00:02 itsbilal

@renatolabs / test-eng: friendly ping for a review, thanks! no rush but would be good to get other teams to start adding operations soon so the OperationSpec is probably the most important thing to look at.

itsbilal avatar Feb 13 '24 19:02 itsbilal

The sprawling changes to test_runner (and other roachtest infrastructure) introduced by this PR are a bit too sprawling for our comfort :) We'd prefer a cleaner approach, one that ideally doesn't require (non-trivial) changes to the test_runner. I suggest we backtrack a little and try to nail down the simplest API for implementing and executing operations.

srosenberg avatar Mar 05 '24 04:03 srosenberg

Thanks for the reviews everyone! I had a meeting with Renato today and based on our conversation + comments in this PR, I've greatly simplified this and split it into 3 PRs, #119796, #120023 and #120024, each of which builds on top of the one before. I'll close this PR for now as it's superseded by those in combination. Thanks a ton again; I know this has been a lot of effort for everyone involved and I really appreciate it all.

itsbilal avatar Mar 06 '24 23:03 itsbilal