system-tests
system-tests copied to clipboard
Fix interaction between interface timeout and RC POLL INTERVAL + Partial revert #2805
Motivation
Being as close as possible from real conditions is a golden rules in functional testing.
The time gained by the change on DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS in #2805 wasn't computed.
Changes
- [x] Revert #2805 on the change on
DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS. - [x] fix the bug on remote config race test
- [x] Remove unecessary library timeout
Workflow
- ⚠️ Create your PR as draft ⚠️
- Work on you PR until the CI passes (if something not related to your task is failing, you can ignore it)
- Mark it as ready for review
- Test logic is modified? -> Get a review from RFC owner. We're working on refining the
codeownersfile quickly. - Framework is modified, or non obvious usage of it -> get a review from R&P team
- Test logic is modified? -> Get a review from RFC owner. We're working on refining the
:rocket: Once your PR is reviewed, you can merge it!
Reviewer checklist
- [ ] If PR title starts with
[<language>], double-check that only<language>is impacted by the change - [ ] No system-tests internal is modified. Otherwise, I have the approval from R&P team
- [ ] CI is green, or failing jobs are not related to this change (and you are 100% sure about this statement)
- [ ] A docker base image is modified?
- [ ] the relevant
build-XXX-imagelabel is present
- [ ] the relevant
- [ ] A scenario is added (or removed)?
- [ ] Get a review from R&P team
Unfortunatly, partial revert is not possible, as DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS and library_interface_timeout seems to be interacting with each other, where they shouldn't.
We need to understand why the interact, and fix that. And if it's legit, choose values based on what why gain, to get the best compromise between optimization and being as close as possible from the real conditions.
We need to understand why the interact, and fix that.
This is out of scope for the values themselves. If there is a correlation, it already existed before.
The time gained by the change on DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS in https://github.com/DataDog/system-tests/pull/2805 wasn't computed.
We know what we gain as the timeout time is absolute, so it goes from 100 seconds to 10 seconds. This is significant time that is otherwise wasted for no reason. The gain is literally the timeout difference.
being as close as possible from the real conditions
This is real conditions, the requests are just made faster because we don't risk overloading the backend or the agent in tests.
I really don't see any actual compelling argument for reverting this. It will only slow down running the tests for theoretical academic reasons with no basis in the actual testing. The pragmatic approach is to make this fast if we can.
A flakyness oberved here https://github.com/DataDog/system-tests-dashboard/actions/runs/10260556706/job/28404990643#step:20:125
I don't remember seeing this issue previously. But can't affirm it's caused by #2805, but it's definitly on the short list.
Local tests on REMOTE_CONFIG_MOCKED_BACKEND_ASM_FEATURES / apache-mod-8.0 / dev shows the issue with or without #2805.
CI is ok (failures are not related)
The bug fixed in this PR caused a race condition that was very unlikely to happen with the previous values of DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS and library_interface_timeout. With values introduced in #2805 , the probability of flakiness in slightly bigger (how much?), without being a problem so far. I need to perform few more tests to have full visibility.
Though, if there is any doubt on issue related to remote config, I recommend to merge this PR before anything else, to remove a source of doubt.
Wait for #2894 for java failures