system-tests icon indicating copy to clipboard operation
system-tests copied to clipboard

Fix interaction between interface timeout and RC POLL INTERVAL + Partial revert #2805

Open cbeauchesne opened this issue 1 year ago • 6 comments

Motivation

Being as close as possible from real conditions is a golden rules in functional testing.

The time gained by the change on DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS in #2805 wasn't computed.

Changes

  • [x] Revert #2805 on the change on DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS.
  • [x] fix the bug on remote config race test
  • [x] Remove unecessary library timeout

Workflow

  1. ⚠️ Create your PR as draft ⚠️
  2. Work on you PR until the CI passes (if something not related to your task is failing, you can ignore it)
  3. Mark it as ready for review
    • Test logic is modified? -> Get a review from RFC owner. We're working on refining the codeowners file quickly.
    • Framework is modified, or non obvious usage of it -> get a review from R&P team

:rocket: Once your PR is reviewed, you can merge it!

🛟 #apm-shared-testing 🛟

Reviewer checklist

  • [ ] If PR title starts with [<language>], double-check that only <language> is impacted by the change
  • [ ] No system-tests internal is modified. Otherwise, I have the approval from R&P team
  • [ ] CI is green, or failing jobs are not related to this change (and you are 100% sure about this statement)
  • [ ] A docker base image is modified?
    • [ ] the relevant build-XXX-image label is present
  • [ ] A scenario is added (or removed)?

cbeauchesne avatar Aug 05 '24 10:08 cbeauchesne

Unfortunatly, partial revert is not possible, as DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS and library_interface_timeout seems to be interacting with each other, where they shouldn't.

We need to understand why the interact, and fix that. And if it's legit, choose values based on what why gain, to get the best compromise between optimization and being as close as possible from the real conditions.

cbeauchesne avatar Aug 05 '24 13:08 cbeauchesne

We need to understand why the interact, and fix that.

This is out of scope for the values themselves. If there is a correlation, it already existed before.

The time gained by the change on DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS in https://github.com/DataDog/system-tests/pull/2805 wasn't computed.

We know what we gain as the timeout time is absolute, so it goes from 100 seconds to 10 seconds. This is significant time that is otherwise wasted for no reason. The gain is literally the timeout difference.

being as close as possible from the real conditions

This is real conditions, the requests are just made faster because we don't risk overloading the backend or the agent in tests.

I really don't see any actual compelling argument for reverting this. It will only slow down running the tests for theoretical academic reasons with no basis in the actual testing. The pragmatic approach is to make this fast if we can.

rochdev avatar Aug 05 '24 14:08 rochdev

A flakyness oberved here https://github.com/DataDog/system-tests-dashboard/actions/runs/10260556706/job/28404990643#step:20:125

I don't remember seeing this issue previously. But can't affirm it's caused by #2805, but it's definitly on the short list.

cbeauchesne avatar Aug 06 '24 14:08 cbeauchesne

Local tests on REMOTE_CONFIG_MOCKED_BACKEND_ASM_FEATURES / apache-mod-8.0 / dev shows the issue with or without #2805.

cbeauchesne avatar Aug 06 '24 15:08 cbeauchesne

CI is ok (failures are not related)

cbeauchesne avatar Aug 07 '24 14:08 cbeauchesne

The bug fixed in this PR caused a race condition that was very unlikely to happen with the previous values of DD_REMOTE_CONFIG_POLL_INTERVAL_SECONDS and library_interface_timeout. With values introduced in #2805 , the probability of flakiness in slightly bigger (how much?), without being a problem so far. I need to perform few more tests to have full visibility.

Though, if there is any doubt on issue related to remote config, I recommend to merge this PR before anything else, to remove a source of doubt.

cbeauchesne avatar Aug 08 '24 07:08 cbeauchesne

Wait for #2894 for java failures

cbeauchesne avatar Aug 19 '24 17:08 cbeauchesne