nodejs-pubsub
nodejs-pubsub copied to clipboard
Google Cloud Pub/Sub triggers high latency on low messages throughput
i'm running project which publishes messages to a PubSub topic and triggers background cloud function.
I read that with high volumes of messages, it performs well, but for lesser amounts like hundreds or even tens of messages per second, Pub/Sub may yield high latencies.
Code example to publish message:
const {PubSub} = require('@google-cloud/pubsub');
const pubSubClient = new PubSub();
async function publishMessage() {
const topicName = 'my-topic';
const dataBuffer = Buffer.from(data);
const messageId = await pubSubClient.topic(topicName).publish(dataBuffer);
console.log(`Message ${messageId} published.`);
}
publishMessage().catch(console.error);
Code example of function triggered by PubSub:
exports.subscribe = async (message) => {
const name = message.data
? Buffer.from(message.data, 'base64').toString()
: 'World';
console.log(`Hello, ${name}!`);
}
Environment details
- OS: Windows 10
- Node.js version: 8
-
@google-cloud/pubsub
version: 1.6
The problem is when using PubSub with low throughput of messages (for example, 1 request per second) it struggles sometimes and shows incredibly high latency (up to 7-9s or more).
Is there a way or workaround to make PubSub perform well each time (50ms or less delay) even with small amount of incoming messages?
Thanks!
@andryichenko Sorry for the late reply on this, but could it be that the publisher piece you showed above is queueing the messages for batch send? You might try decreasing the max batching timeout:
pubSubClient.topic(topicName, {
batching: { maxMilliseconds: 10 }
})
The default is already low (100), but it's possible something is getting set incorrectly. It's also possible that it's not properly obeying that number in your case, in which case I need to do some more debugging.
The other possibility is that it's getting hung up in the machinery somewhere between your client and your GCF function executing. I can pass it off to the right people in that case, but we should eliminate the client library first.
@feywind Thanks for the reply!
I already tried to decrease the max batching timeout, but it didn't really help. (It seemed to me that messages are going faster but still it struggles).
Also as can be found in this guide, PubSub tries to gather multiple messages before delivering it. In other words, it tries to deliver many messages at once. In this specific case to achieve a more realistic real time scenario, should be specified a batch size of 1, which would cause PubSub to delivery every message separately.
So i also added maxMessages
property to batching
object, like so:
pubSubClient.topic(topicName, {
batching: {
maxMessages: 1,
maxMilliseconds: 10
}
})
But it seemed also not really helpful. After some research i come to conclusion that it may be cloud functions causing delays on cold start. So as a workaround i tried to make a simple node.js server that pinging cloud functions via pub/sub so they remain heat. And it shows much better results now. I'm still trying to find good solution for this though.
We also face the same problem: when we send 1-2 messages per minute the latency is around 1200 ms but once we push more messages it becomes faster and faster (around 10ms). It looks like there is some cache mechanism for pushing the messages.
I'm not as knowledgable about what happens after it leaves the Pub/Sub library, so it's possible something down in gax/gRPC is doing its own buffering. I can ask the maintainers of those libraries if there's anything possibly there. But as for the Pub/Sub library itself, it should be obeying that batching configuration and not pausing. I will peek at that and see if there's possibly a bug there, that it's not using the right values.
@feywind did you find any solution?
@feywind did you find any solution?
I missed that there were two issues potentially related to this: https://github.com/googleapis/nodejs-pubsub/pull/1087
Do you know if 2.5.0 still has the issue?
Seems like this has been merged and hasn't solved the problem of latency.
@dozie Thanks for the update!
@alexander-fenster Sorry to ping you again :D but could there be something introducing latency and doing its own batching of calls at the gax level? If setting the batch size to 1 for the Pub/Sub library doesn't reduce the latency, it seems like it might be something at a lower level.
@feywind I can confirm that batch size settings is set to 1 message, but latency didn't reduce.
We are experiencing a similar issue: high latency (5 ~ 10 seconds inconsistently) on low messages throughput. I don't think it's related to the message producer, but maybe the pubsub server itself and here is why:
We keep tracking the publish_time
on message and from the official google doc we know that this publish_time
is the time pubsub server receives the message, so it doesn't matter what you set up from the message producer side. And the time between the publish_time
and the time our client/message-consumer receives the message can be 5 ~ 10 seconds when the traffic throughput is low, especially in the scenario that a message sent to pubsub after a long time of no message sent.
I have been looking for some SLA of pubsub to guarantee that "queue time" but could not find it.
@kamalaboulhosn Since there's a question here about the server side... Is it possible that the service is doing some sort of batching or delaying to queue up more messages?
@kamalaboulhosn Gentle ping on this since it seems like it might be a service question at this point.
Same issue here, but the latency is around 5 minutes. We have two topics between a GAE app and a cloud function, a request-response cycle. The latency shows up only on the GCF -> GAE topic. The GAE app runs the python client, the GCF is on "@google-cloud/pubsub": "^2.10.0"
.
The function is almost identical to what @andryichenko has posted. Not sure how to debug the latency on the way back to GAE other than looking at HTTP log timestamps.
Things I've tried:
- Created a parallel pull subscription to check timestamps on the messages. Same 5 minutes there.
- Change batching to
maxMessages: 1
, same behavior.
Things I've considered but didn't have time to test:
- Forcing
flush()
on the cloud function before it ends. This is based on assumption that GCF blocks the CPU for any background functions that haven't been terminated after the function is done. - Recreating the topic/subscriptions.
I've enabled some additional logging, and it looks like the function exists almost instantly, leaving the actual (async) code to run in the background for some time:
The last line comes from this code:
const publish = async (event) => {
const client = new PubSub({ apiEndpoint: config.PUB_SUB_API_ENDPOINT });
const data = Buffer.from(JSON.stringify(event));
const topic = client.topic(topicName, { batching: { maxMessages: 1 } });
const messageId = await topic.publish(data, metadata);
logger.log('Message published', messageId);
await topic.flush();
logger.log('Flushed', messageId);
return messageId;
};
Some observations:
- The latency up until (and including) the line
2021-03-12 12:45:49.244
is expected. But the 2 minute gap after that line isn't. I attribute the gap to the code above. In subsequent runs the latency on this step fluctuates around 1-2 minutes mark. - Interestingly, the call to
.flush()
fails with3 INVALID_ARGUMENT: The value for message_count is too small. You passed 0 in the request, but the minimum value is 1.
. - Looks like the awaits are not respected by the cloud function runner.
Ok, mistery solved!
The main cloud function handler was wrapped into Sentry.GCPFunction.wrapEventFunction
which somehow loses track of the returned promises. In its turn, that lead to throttling the function and skyrocketing the latency.
I've temporarily removed the wrapper, and the whole thing now takes 7 seconds to run. And that's including multiple external HTTP requests 🎉
@killthekitten Ah, yeah. I've seen a few issues in the past with GCF losing track of async happening in the background. I'm glad you found something to help there.
I'm not sure how many of the comments above were using GCF, but it looks like the original poster was doing so. Is anyone else having a non-GCF issue with this?
I am experiencing the same, having a delay of 15s and more for a few times. Most of the time everything works ok. If I don't use orderingKey
there seems to be no delay after some quick testing (we rely on orderingKey
so I did not investigate further). I re-created all subscriptions and topics and everything seems to be fine now. I started seeing these problems from 2022-02-09. Perhaps something changed in the PubSub-system?
@flunderpero There is currently a known issue in the Pub/Sub service that is causing the 15s delay. The change that caused the issue is being rolled back right now. This issue would only have existed between 2/7/2022 and now.
@kamalaboulhosn Thanks for letting me know. This actually coincides with when the problems started for us.
@flunderpero There is currently a known issue in the Pub/Sub service that is causing the 15s delay. The change that caused the issue is being rolled back right now. This issue would only have existed between 2/7/2022 and now.
Is there any way that we can track that issue as we're seeing these same 15s delays? From what we can see there hasn't been a change to the service since December according to the Pub/Sub release notes: https://cloud.google.com/pubsub/docs/release-notes
Or are you talking about something else @kamalaboulhosn
@ChrisWestcottUK The release notes for the server only track significant API/region availability/feature availability changes, not every server change that is made. There is no public tracking for all of these such changes. For tracking issues if they affect your projects, it is best to put in a support request so support engineers can follow up. This particular incident should now be resolved.
One thing that appeared to have also occurred at this time was messages published multiple times (we're using an orderingKey
):
- Without errors, we were getting back a message id when publishing. So the message publish appeared fine.
- Comparing those message id's with the subscriber message id's shows a mismatch (because they are message id's of the hidden failures).
- This means we have subscribers receiving multiple messages for the same source messages but different message id's.
- According to documentation around idempotency with Cloud Functions it's suggested to use
eventId
as an idempotency key: https://cloud.google.com/blog/products/serverless/cloud-functions-pro-tips-retries-and-idempotency-in-action - The
eventId
is the message id: https://cloud.google.com/functions/docs/calling/pubsub#event_structure
Doesn't that show that it's important to choose good idempotency keys and not follow the suggestions in the documentation? Or is there something I'm missing?
Linked to the meta-issue about transport problems: b/242894947
I'm seeing this issue as well. I also faced it with Cloud Functions but I took my cloud function code and modified it a bit so I could easily trigger it locally with node testProblem.js
. So it is just like your typical node command line app. This still causes a 2-5 second latency for publishing a message to a topic. At the same time a different app involved running in GKE uses the Java version of pubsub library (via Clojure) and that works lightning fast, no delay whatsoever. The same lightning fast performance is retained when running that app locally so it is not related to my personal internet connectivity or similar wonky stuff.
To me this seems like the batching disabling doesn't work properly. Batching configurations are respected, if publishing less messages than maxMessages
then the latency is very close to maxMilliseconds
as one would expect. If however maxMessages
is set to 1 then this weird latency occurs. If sending two messages with maxMessages
set to 2
they both finish in identically the same amount of time, which again is to be expected since they are batched. If sending two messages with maxMessages
set to 1
they both finish about twice as fast! This does NOT make any sense.
Since Java pubsub doesn't suffer from problems such as this at all I'm leaning towards that the whole batch scheduling logic in this library is somehow broken. I'll look into this for a while still but if I can't find the issue really soon we have to rewrite our Cloud Functions with a different language to get acceptable performance since this issue causes a user facing performance problem.
Going back to the original issue, I think the problem might be in the way the topic is being instantiated. The topic is being recreated for every publish, which means the topic is not being reused. It also means that the connection to the server has to be renegotiated every time. It would be better to do this:
const {PubSub} = require('@google-cloud/pubsub');
const pubSubClient = new PubSub();
const topicName = 'my-topic';
const topic = pubSubClient.topic(topicName);
async function publishMessage() {
const dataBuffer = Buffer.from(data);
const messageId = await topic.publish(dataBuffer);
console.log(`Message ${messageId} published.`);
}
publishMessage().catch(console.error);
If the topic name is not known before the publish and can change, then you should keep a map from topic name to topic
object and reuse it.
I’m seeing a similar issue over here. My setup is similar to the OP - Firebase cloud function triggers individual PubSub topic messages. I’m seeing latency in minutes rather than seconds.
Interestingly the topic().publish()
method seems to be taking a long time based in the logs.
Ive tried setting max messages but that didn’t seem to help.
batching: {
maxMessages: 1
}
Anything else I can try or information to share that would help debug this?
I’m seeing latency in minutes rather than seconds.
Have you tried debugging it the same way I did in one of the comments above? It could be caused by throttling on a function because of an async wrapper, or any async code that you aren't awaiting properly.
The key is to log when the function returns - does it return sooner than the publish() call?
@killthekitten Great tip thanks! Turned out that was the problem in my case - I was not waiting for the topic().publish()
to complete before terminating the cloud function. The fix was to add an .await()
on the publish Promise
and now the messages seem to be delivered much faster 👍
Hi all, I just made some updates to the batch flushing logic for publishing, and that's in 3.4.0. I'd be interested to hear if it helps this.