rust-server-sdk icon indicating copy to clipboard operation
rust-server-sdk copied to clipboard

High volume of "Failed to send events" errors

Open samscott89 opened this issue 1 year ago • 7 comments

Describe the bug

We're seeing a high volume of:

Failed to send events. Some events were dropped: hyper::Error(Http2, Error { kind: GoAway(b"", NO_ERROR, Remote) })

errors logged in production

To reproduce

Not sure beyond "run the SDK for a while"?

Expected behavior

If these errors are benign then I wouldn't expect them to log an ERROR event. Otherwise we'll need to silence all logging from the launchdarkly sdk which would be a shame.

If there errors aren't benign, then I guess I would expect maybe a retry?

Logs

That's all I have I'm afraid

log.file	/github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/launchdarkly-server-sdk-2.1.0/src/events/sender.rs

log.line	140

log.module_path	launchdarkly_server_sdk::events::sender

log.target	launchdarkly_server_sdk::events::sender

SDK version

First happened on 1.1, we upgraded and still happening on 2.1

Language version, developer tools

Rust 1.78.0

OS/platform

Amazon Linux 6.1.87-99.174.amzn2023.x86_64

samscott89 avatar May 07 '24 13:05 samscott89

Thank you for bringing this to our attention. We will investigate and let you know once we have a resolution!

keelerm84 avatar May 07 '24 17:05 keelerm84

Hey @keelerm84 , any update on this? We're on 2.1.0 but seeing a similar problem:

error on event stream: Eof

image

Rust Tracing Fields
log.file          /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/launchdarkly-server-sdk-2.1.0/src/data_source.rs
log.line          151
log.module_path   launchdarkly_server_sdk::data_source
log.target        launchdarkly_server_sdk::data_source

samscott89 avatar Jul 26 '24 14:07 samscott89

@samscott89 a couple of questions for you.

  1. The initial error was about Failed to send events. Some events were dropped [...]. Are you still experiencing that problem?
  2. For the error on event stream: Eof error, does the included graph represent ONLY occurrences of that specific error?
  3. Can you describe your setup? Are you connecting directly to LaunchDarkly APIs or are you using the relay proxy?

keelerm84 avatar Jul 30 '24 15:07 keelerm84

The initial error was about Failed to send events. Some events were dropped [...]. Are you still experiencing that problem?

Ah good catch, it seems like those stopped around May 7th. Happy to reopen as a different issue or change title if that would be helpful.

And to reiterate a point from the first message: we're not observing any erroneous behaviour, but the current impact is that we're silencing all errors from launchdarkly since these seem non-actionable.

For the error on event stream: Eof error, does the included graph represent ONLY occurrences of that specific error?

Yes that's correct. These errors started on April 30th. We've seen 14k events since then. Seems to be paired with this log event in case that's helpful:

image

Can you describe your setup? Are you connecting directly to LaunchDarkly APIs or are you using the relay proxy?

Pretty vanilla setup I think, connecting directly:

        // By fiat, tests will not be allowed to hit LaunchDarkly.
        let offline_mode = cfg!(test) || matches!(mode, FlagClientMode::Offline);

        let client = {
            let config = ConfigBuilder::new(sdk_key).offline(offline_mode);

            tracing::info!("Starting the LaunchDarkly client in {mode:?} mode");
            Client::build(config.build().expect("valid config")).expect("build launchdarkly client")
        };

        client.start_with_default_executor();

        let start = std::time::Instant::now();

        // Wait to ensure the client has fully initialized.
        // Offline mode clients will always be immediately initialized, so this
        // will be a no-op for them.
        let init_ld_span = tracing::info_span!("init_ld");
        let initialized = client
            // NOTE(Sam): max observed time in production for the last two months
            // is 25s
            //
            // If this is failing and we're unable to connect to launchdarkly in 2 minutes
            // consider deploying with flags in offline mode
            .wait_for_initialization(std::time::Duration::from_secs(120))
            .instrument(init_ld_span)
            .await;
        if initialized != Some(true) {
            panic!("Couldn't start the LaunchDarkly client");
        }
        tracing::info!("LaunchDarkly client startup took {:?}", start.elapsed());

samscott89 avatar Aug 02 '24 14:08 samscott89

It looks like we are simply printing that log message when we don't need to. I am making a change to suppress that message when it's an EOF response since that's an expected condition and we handle it fine, as you noted.

I will let you know once a release with the fix has been cut.

Thank you for your help and patience with this.

keelerm84 avatar Aug 08 '24 19:08 keelerm84

v2.2.1 has been released which I believe should quiet down that error for you. Please let us know!

keelerm84 avatar Aug 08 '24 20:08 keelerm84

Perfect, thank you! Will give it a try

samscott89 avatar Aug 13 '24 16:08 samscott89

This issue is marked as stale because it has been open for 30 days without activity. Remove the stale label or comment, or this will be closed in 7 days.

github-actions[bot] avatar Sep 13 '24 01:09 github-actions[bot]