cli icon indicating copy to clipboard operation
cli copied to clipboard

Address feature flag timeout issues

Open sgagniere opened this issue 2 weeks ago • 2 comments

Release Notes

Breaking Changes

  • PLACEHOLDER

New Features

  • PLACEHOLDER

Bug Fixes

  • Resolve an issue where on-premises users without access to confluent.cloud could experience significant delays when running commands

Checklist

  • [x] I have successfully built and used a custom CLI binary, without linter issues from this PR.
  • [x] I have clearly specified in the What section below whether this PR applies to Confluent Cloud, Confluent Platform, or both.
  • [x] I have verified this PR in Confluent Cloud pre-prod or production environment, if applicable.
  • [ ] I have verified this PR in Confluent Platform on-premises environment, if applicable.
  • [x] I have attached manual CLI verification results or screenshots in the Test & Review section below.
  • [ ] I have added appropriate CLI integration or unit tests for any new or updated commands and functionality.
  • [x] I confirm that this PR introduces no breaking changes or backward compatibility issues.
  • [x] I have indicated the potential customer impact if something goes wrong in the Blast Radius section below.
  • [x] I have put checkmarks below confirming that the feature associated with this PR is enabled in:
    • [ ] Confluent Cloud prod
    • [ ] Confluent Cloud stag
    • [ ] Confluent Platform
    • [ ] Check this box if the feature is enabled for certain organizations only

What

The issue is that in situations where users cannot connect to confluent.cloud to retrieve feature flags, the CLI would spend 30 seconds attempting to retrieve each flag. While this can be disabled through confluent configuration update, even the command to disable feature flags would still experience this significant delay before finally disabling them.

To resolve this poor UX, this PR does the following:

  • Only retrieve feature flags when the user is logged into Cloud (since our flags are all related to Cloud anyway)
  • Reduce the timeout from 30 seconds to 5 seconds
  • No longer attempt to retrieve flags for the duration of the command if one feature flag request times out

Additionally, this PR replaces a previous bandaid solution for the issue of feature flag warnings while running the linter or generating the docs. Previously, we simply suppressed feature flag warnings generated by the docs or lint code. The new approach is to spin up the test server to handle feature flag requests locally (and block the CLI from writing the test values to the config).

Blast Radius

Some commands, like confluent iam ip-filter create or confluent iam pool create add command flags based on feature flag values, and some commands like confluent iam rbac role describe accept additional inputs based on feature flags.

Significant latency combined with a shorter timeout may occasionally cause these options to be unavailable.

References

Test & Review

Test doc: https://docs.google.com/document/d/1lA7ZYgKOOaSSqfcp8G7aDlovixH8U0tPZu-Q9dAXo4M/edit?usp=sharing

sgagniere avatar Nov 11 '25 22:11 sgagniere

:tada: All Contributor License Agreements have been signed. Ready to merge.
Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.

How do we verify the 2 scenarios below described in the testing document? It seems like we have different default timeout before this change.

  • not logged in, and confluent.cloud is unreachable
  • If feature flags aren't retrievable after logging in:

channingdong avatar Nov 17 '25 19:11 channingdong

How do we verify the 2 scenarios below described in the testing document? It seems like we have different default timeout before this change.

I updated the testing doc w/ an explanation of how to reproduce the testing conditions.

sgagniere avatar Nov 17 '25 19:11 sgagniere

Note on code coverage:

Coverage of pkg/featureflags/feature_flags.go is 87.5%, and that's the only file out of the 3 changed that's actually part of the CLI itself.

sgagniere avatar Nov 17 '25 23:11 sgagniere

Quality Gate failed Quality Gate failed

Failed conditions
50.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube