pekko-http icon indicating copy to clipboard operation
pekko-http copied to clipboard

Random occasional `SubscriptionWithCancelException$NoMoreElementsNeeded` after migrating from Akka to Pekko

Open Tvaroh opened this issue 1 year ago • 15 comments

Hey, I notice that when I migrate to Pekko (form latest "free" Akka), I start getting occasional org.apache.pekko.stream.SubscriptionWithCancelException$NoMoreElementsNeeded$ (No error message supplied) from web clients.

I've increased stream-cancellation-delay to 1s but still it fails sometimes. It does really look like something changed from Akka to Pekko.

I don't really have any code to reproduce as it's not consistent, maybe once in couple of days. But reverting back to Akka does fix it.

Tvaroh avatar Jan 16 '24 15:01 Tvaroh

Do you mean it works after you revert back to akka 2.6.20?

He-Pin avatar Jan 16 '24 15:01 He-Pin

@He-Pin yep.

Tvaroh avatar Jan 16 '24 15:01 Tvaroh

Could you please share a simple reproducer, thanks @Tvaroh

He-Pin avatar Jan 16 '24 15:01 He-Pin

As I said, it's inconsistent and happens sporadically. So I don't have a reproducer, but in the end it calls Http().singleRequest(request) which fails.

Tvaroh avatar Jan 17 '24 12:01 Tvaroh

I don't know about any change which could have caused this issue. Can you still validate that the old version based on Akka is still fine (to rule out a coincidence where something else in the wider environment might have changed in the meantime)?

Your best bet is trying to collect debug log message from org.apache.pekko.http.impl.engine.client.pool.NewHostConnectionPool to isolate the issue.

Since the issue indeed looks like it could be caused by the infamous stream cancellation race condition, you could try setting stream-cancellation-delay to much higher values to rule out any race-conditions that might still occur in a 1 second time frame caused by GC or other interruptions.

What kind of requests run into that problem? Usually, idempotent requests are retried a few times, which would make it even more unlikely to happen. So, this might also be an issue of unfortunate timing, where an non-idempotent request (e.g. POST) is run at exactly the same time as the server decides to close the connection (however, that scenario is reported in an unclear way caused by stream cancellation). In that case, playing around with idle-timeout and host-connection-pool.keep-alive-timeout (see also its documentation in the reference.conf) might help mitigate the issue.

jrudolph avatar Jan 22 '24 12:01 jrudolph

@jrudolph yeah it only happens for requests to a particular internal service written in (God forgive) C#, so this can be some weird corner case, but it def doesn't happen on Akka. I'll try your suggestions and will keep an eye on it. For now I've added some retries, and yeah - I've seen it fail on GET and POST at least. Thanks for the reply, I'll apply some config changes and will post back.

Tvaroh avatar Jan 22 '24 13:01 Tvaroh

@Tvaroh have you found the root cause or a permanent fix? I have a service going through the same issue after a Pekko migration (Scala 2.12, pekko-cluster-sharding, pekko-http-circe).

djjorjinho avatar Aug 12 '24 16:08 djjorjinho

@djjorjinho it's mysterious but after adding retries those exceptions eventually went away by themselves, perhaps something changed externally, so I'm afraid I have no answer to that.

Tvaroh avatar Aug 13 '24 09:08 Tvaroh

@Tvaroh no worries, thanks for the reply.

Can you share if you made any changes to any config inside pekko.http.host-connection-pool or pekko.http.client, like stream-cancellation-delay, idle-timeout, keep-alive-timeout, max-connection-lifetime, etc.?

djjorjinho avatar Aug 13 '24 09:08 djjorjinho

@djjorjinho good point, sure, here what we ended up with:

pekko {
  loggers = ["org.apache.pekko.event.slf4j.Slf4jLogger"]
  loglevel = "DEBUG"

  http {
    client {
      stream-cancellation-delay = 2s

      parsing {
        max-content-length = 128m
      }
    }

    server {
      stream-cancellation-delay = 2s
      request-timeout = 60s
    }

    parsing {
      max-uri-length = 8k
      max-header-value-length = 8k
      max-to-strict-bytes = 64m
    }

    routing {
      decode-max-size = 16m
    }

    host-connection-pool {
      max-open-requests = 64
      response-entity-subscription-timeout = 10.seconds
    }
  }
}

Tvaroh avatar Aug 13 '24 09:08 Tvaroh

Just to add an extra note - this was a problem in Akka as well: https://github.com/akka/akka-http/issues/3201

And, looking at the issue, I think it was never fixed. The only PR referencing that issue was by @mdedetrich, which was not merged (due to conflicts with maintaining Pekko) and, by the title, sounds more like it was a workaround.

JD557 avatar Sep 04 '24 12:09 JD557

So I can still finish off the PR in Pekko as it appears that I basically didn't finish off the PR (changing the value caused some tests to fail and I needed to do adjustments to fix those tests)

mdedetrich avatar Sep 04 '24 12:09 mdedetrich

PR made at https://github.com/apache/pekko-http/pull/590, still need to fix the tests still they don't expect such a long stream-cancellation-delay

mdedetrich avatar Sep 04 '24 12:09 mdedetrich

With Pekko being forked off of Akka 2.6.x it seems that we can now solve this properly i.e.

IIRC, the whole reason to introduce the delay was to deal with the fact that stream cancellation could not be well-configured in Akka 2.5. With 2.6, it's not 100% clear if the delay is even still needed or whether it could be solved in a better way (i.e. by resolving the cancellation race more intelligently, taking the cancellation reason into account at the right/more places) to avoid symptoms as shown in the ticket.

Originally posted by @jrudolph in https://github.com/apache/pekko-http/pull/590#discussion_r1747075727

@pjfanning Is it worth it to try and solve this properly before the 1.1.0 release?

mdedetrich avatar Sep 06 '24 13:09 mdedetrich

#590 has been reverted (#597). We can revisit this for v1.1.1.

pjfanning avatar Sep 21 '24 17:09 pjfanning