aptos-core
aptos-core copied to clipboard
[api] Add REST API to modify failpoints during execution
Description
Adding failpoints REST API (/failpoints?name=str&actions=str), to allow modifying failpoints during execution. This will allow more targeted/surgical testing - both within smoke tests as well as forge/cluster tests
Test Plan
Used in new smoke tests in the stacked diffs.
As far as I know, @banool is working on generating OpenAPI spec from code. This might get such endpoints populated in the spec and confuse developers.
Do we have other ways to config this on the fly besides relying on the Rest API?
I think having this as part of public api is confusing too
I feel like it makes more sense to configure this via runtime arguments (the config / CLI). Do you expressly need to modify failpoints during a single process lifetime?
I am fine to change this to a different alternative.
But I do need to change it multiple times, dynamically, during process execution. Looking at smoke test setup I only have REST API as the client to interact with it :) so open to alternatives, if anyone has a suggestion?
I can also make this fully behind failpoints flag, so with that - it shouldn't affect generating public docs?
With the API changes I'm making we can conditionally include certain groups of endpoints in the API. This should definitely be an opt in feature, we don't want it to be possible for random people to make a node's endpoints start failing. But I'm not quite ready with that stuff yet. How urgently do you need this feature?
@banool - how about this ? if you don't compile with failpoints feature, no new endpoints will be added?
We don't really configure much via features today for the API, all configuration is done via a config file. You can imagine that having different permutations of features for our node would result in an explosion in the number of docker images we build (though we could of course just not do that). Either way, I think having another flag in our config file would be more standard.
So to confirm, you are fine with conditionally including this into REST API?
failpoints library requires features enabled, so from that perspective - even if there was config, if it wasn't compiled with appropriate feature - the rest of the code wouldn't work. So I would leave it as is, and if me move from failpoints to something custom, change from feature to config to enable it. sounds good?
Yeah my stance is:
- Configuring via the API is okay given you need to be able to change these failpoints within a single process lifetime.
- The endpoints should only be included if an opt-in flag is set in the config.
This second point is possible with the new API I'm making, not sure about the previous one (but probably also).
Agreed with Igor. The API crate already supports the failpoints feature, we need to include the endpoint only when failpoints feature is enabled. Once this is in place, the opt-in flag is not needed for the endpoint, I guess.
Okay fair enough, we can do that.
I'm with @banool, I think features are great until you realize you accidentally deployed a binary with a feature enabled and the only way to disable that feature is by providing a new binary. I'm good with this PR if we can make exposing this configurable via a config file and not just a compilation flag.
Wow a lot of discussion. Yes, this has to be configured by config and not API, otherwise you could have some unfortunate mistakes without a way to quickly fix it
That's not quite what the latest discussion has been about, we've been discussing whether to enable the failpoints API at all via either a feature or a config flag.
I think we agree that dynamically setting failpoints with the API is reasonable.
Added node_config.api.failpoints_enabled that is false by default, that needs to be enabled to allow setting failpoints through that endpoint. Let me know if you have any more concerns.
:white_check_mark: Forge test success on b108e7421ec67cb38d4011640497c757fde2570f
all up : 4844 TPS, 2282 ms latency, 3000 ms p99 latency,no expired txns
- Grafana dashboard
- Validator 0 logs
- Humio Logs
- Test runner output
- Test run 1 is land-blocking
Heads up: I'll add this to the new API too. Should be transparent from your side.