nodejs-pubsub
nodejs-pubsub copied to clipboard
Publishing messages burst timeout
Environment details
- OS: Google Kubernetes Container
- Node.js version: 12.15.0
- npm version: -
-
@google-cloud/pubsub
version: 2.18.3
Steps to reproduce
- ?
- ?
We're seeing an issue in our production environment. It happens pretty inconsistently, so I'm not sure of how exactly to reproduce it.
This service publishes messages to a couple of topics consistently, and the publishing message volume is around 1 MiB per second. The errors for us come in bursts rather than consistently, and they come from a single pod at a time (we run about 150 pods on this service). For example, we'll see a burst of ~5k errors for all of the topics coming from pod A, and the next day we'll see that from pod B. It happens in several hours or days. Rolling out the deployment or killing the offending pod resolves the errors for at least a few hours. The errors aren't resolved by themselves in a short time, at least aren't within 20 minutes.
BTW, the pubsub instance is created once and reused for subsequent publishes.
The error message and stack:
Error: Total timeout of API google.pubsub.v1.Publisher exceeded 600000 milliseconds before any response was received.
at repeat (/deploy/my-project/node_modules/google-gax/build/src/normalCalls/retries.js:66:31)
at Timeout._onTimeout (/deploy/my-project/node_modules/google-gax/build/src/normalCalls/retries.js:101:25)
at listOnTimeout (internal/timers.js:531:17)
at processTimers (internal/timers.js:475:7)
Thanks! Please let me know what other information would be helpful.
This looks like a client-side issue. All pods but one are able to send requests to Pub/Sub and get a response back. Removing the bad pod is a good temporary fix. To fix this for good, we need to know what the bad pod is doing with those failed requests. Was it unable to send the requests in the first place? Or was it unable to receive the responses back. Chances are the client might have exhausted its network connections? Or, the connection to Pub/Sub endpoint might have been disconnected, but the client is unaware of it and is still using the broken connection without re-establishing a new one. How resilient is your pod in handling error conditions? Can you share a code snippet on how messages are being published in the pod, and how it handles error conditions?
@githubwua Thanks. I think it is more about this - the connection to Pub/Sub endpoint might have been disconnected, but the client is unaware of it and is still using the broken connection without re-establishing a new one
. Because the other pods in the Kubernetes node are normal for this case, the other network connections of the pod are normal too.
Could you suggest how to re-establish a new connection in this case?
The code is sth. like:
const { PubSub } = require('@google-cloud/pubsub');
const pubsub = new PubSub();
const topic_map = {
topic_1: pubsub.topic('topic_1'),
topic_2: pubsub.topic('topic_2'),
topic_3: pubsub.topic('topic_3'),
topic_4: pubsub.topic('topic_4'),
topic_5: pubsub.topic('topic_5'),
}
async function publish(opts) {
const { topic_name, attributes, data } = opts;
const topic = topic_map[topic_name];
try {
await topic.publish(data, attributes);
} catch (e) {
console.error(e);
}
}
To be clear, is the issue that the publishes themselves aren't going through, or that the messages aren't being received in a subscriber? If it's the latter, it might be another instance of this: https://github.com/googleapis/nodejs-pubsub/issues/1135
If there's a network cause for that issue, this might also finally be it manifesting for publish, too. :|
@feywind It's the former. I think it relates to the handling of the pubsub network connection.
We are also seeing this error:
Unable to publish for key "....". Reason: Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received.
And this error:
Total timeout of API google.pubsub.v1.Publisher exceeded 60000 milliseconds before any response was received.
Nothing seems to change in our code/usage to cause the error, but we see them in bursts every few weeks. The most recent incident was 206 events between 2022-04-07T11:56:29.275Z and 2022-04-07T12:15:49.489Z. After this time, everything seems to have returned to normal. This occurred across 4 different servers in approximately the same time range in the EU region, but nothing shows up on the google cloud console's status page. Given the independent, simultaneous, failures on multiple servers, I think this must have been a service disruption on the Google Pub/Sub service.
Should the library automatically handle this? Should I be adding my own retry logic?
Update:
This seems to be ongoing, with a steadily growing number of events.
Looking at https://github.com/googleapis/nodejs-pubsub/blob/v2.19.0/src/v1/publisher_client_config.json all the methods specify a retry timeout of 60_000
which is equivalent to the initial_rpc_timeout_millis
value of 60_000
. Assuming the timeout_millis
value on each method overrides the total_timeout_millis
value, this means each request will always reach a total timeout before it has a chance to retry even once.
It looks like this may have actually been fixed in 8c1afeee26fbff84448b1a9576e4c4ef5c67f9cc (v2.18.0) and then subsequently broken again in 75d7335e763dad1801409c242d9cfe5d73b8a773 and made even worse in 1e110016ea04fc41d2fa80d64a54a11fd4078c60 (v2.18.3) before returning to bad but not quite as bad in 34a4d4aa50c6f07adcec1dcba1ee0eaf941f71c0 (v2.18.5).
Can we get any official answers on what a good default for these timeouts would be?
Seeing very similar behavior in GKE deployments as well. For us, the application never recovers, restarting the pods is the only way to get the client functional again. This is happening for us at much lower throughput than the OP -- some of the pods experiencing this are probably averaging less than one message per second and in the ballpark of 10KiB per message or less. Bursts of many of these messages definitely happen however. It is unclear if our failures correlate with the bursts.
We are experiencing the same problem. From time to time a service (for us it's either GKE or Cloud Run) just won't recover from a timeout error as described above. Only restarting (or in case of Cloud Run re-deploying) the service solves the problem. We have a fairly low publish rate (0.1 to 1 per second) and no bursts at all. It all started around April 7th.
I also opened a ticket with Google Cloud Support.
Motivated by the above comment, I checked my teams logs, and we too see a sudden uptick in this error, though for us it started on April 5th
We're seeing the same thing. It started in late March for us.
We see the same thing, it happened to three or four different pods this past weekend, and is happening on a weekly basis.
Despite our pods not reaching their cpu limits, it seems we solved the issue by raising the requested cpu on our pods.
Thanks for the patience. I posted a snippet over here that should extend the publisher RPC deadlines. I think that's papering over the real issue, which isn't known yet, but maybe it helps get everyone moving again.
We've just seen a massive uptick in these errors. Approximately 1 in 6 publish attempts are currently failing and restarting the servers does not resolve the issue.
We see "unavailable" and "cancelled" errors from Pub/Sub when checking the Pub/Sub topics - Publish Requests in monitoring:
These errors line up exactly with the timeout errors we see from the client.
Linked to the meta-issue about transport problems: b/242894947
@feywind Where can we view that linked issue?
@ForbesLindesay That's unfortunately an internal bug reference. We're trying to build some momentum for cross-functional debugging on some of these issues, because there's a belief that it involves several components (GKE/GCF/etc). The current belief is that this isn't in the Pub/Sub client library itself, so we might shift this to a different project (or the public bug tracker).
We are also having this problem on 3.3.0 of the pubsub library. We also notice a spike in cpu & memory when this happens.
We have not been able to find a general way to approach debugging these problems, but instead have required customer-specific investigations around their setup. Please enter a support case with all of the details of your clients and the environments in which they run in order to continue the investigation. Thanks!
I'm also seeing this issue where pod/node has a timeout and then all attempts to enqueue a message fail after that. Maybe it would be better if we didn't reuse the PubSub() object, and constructed a new one for each enqueue?