nats-server
nats-server copied to clipboard
JetStream Batch Publish support
Feature Request
Hi. I know this is not the 1st issue to support the batch publishing. I want to suggest an easy to implement API call retaining all the benefits of the batch mode and transactions. @derekcollison
Use Case:
Client wants to publish a batch of messages in a single request over the network utilizing a single connection (no multiple async requests). It's ok if a server can not accept all the messages from the batch - just return the number of messages accepted successfully (w/o blocking, until space is exhauseted, whatever).
Consider server returns 0 and no error - no messages accepted from the batch. Client may do a backoff.
Server returns 42 from 100 messages - client knows the first 42 messages have been accepted by server. Client may retry the remaining 58 messages and so on.
When a non-nil error is returned client knows exactly which specific messages have been accepted.
Proposed Change:
Implement the following API call:
PublishMsgBatch(mb []*Msg, opts ...PubOpt) (ackCount uint32, err error)
Who Benefits From The Change(s)?
Enables much higher possible small message throughput.
Alternative Approaches
The only safe alternative now is to publish messages one by one.
Try PublishAsync which allows for higher throughput rates.
Hi @derekcollison . Thanks for the response. I have seen your suggestion:
for i := 1; i <= 1_000_000; i++ {
js.PublishAsync(fmt.Sprintf("orders.%d", i), []byte("OK"))
}
<-js.PublishAsyncComplete()
However, there are several drawbacks using PublishAsync call for the batch mode:
- It doesn't allow to retain the message order. Consider a message # 42 is rejected by the server while # 43 is accepted.
- It still causes the multiple requests to be sent by the client that creates unnecessary overhead including multiple client side connection checks, lock acquiring + releasing, buffer flushing, etc.
Hence, the PublishAsync cannot be considered as a fair solution for the batch publishing.
You say yourself It's ok if a server can not accept all the messages from the batch - just return the number of messages accepted successfully (w/o blocking, until space is exhauseted, whatever). and that's exactly the behavior you get with PublishAsync.
All those async requests are being sent on the currently established connection to the server, there is no need for things like connection checks, lock acquiring, etc... on the client side. There's no protocol difference between a synchronous or asynchronous JS publication, one waits for the ack message from the JS server synchronously while the other doesn't that's the only difference.
Hi @jnmoyne There are the following important differences actually:
- When using
PublishAsyncmessages are being accepted in the random order. - There are the checks and locks in the NATS client code per
PublishAsynccall: https://github.com/nats-io/nats.go/blob/main/nats.go#L3658 Hence, it will repeat these checks and locks for every message in a batch.
I think there may be even more overhead of making multiple requests instead of a single one
This nc.mu.Lock()
in the publish func @ https://github.com/nats-io/nats.go/blob/main/nats.go#L3665
makes all PublishAsync calls effectively sequential, so it's even less fair
@akurilov
ad 1. Messages are not accepted in the random order. What makes you think it's the case? ad 2. Most of those checks needs to be done anyway, and client is rather IO-bound, so I really doubt you could gain much in that regard.
It’s also worth noting that while it looks simple to have the api call - it’s probsbly going to require a protocol change and a whole lot of new work.
I agree a proper batch publish api would be great but this is a major undertaking that involves server, protocol, jetstream and client work.
I will try to pitch in (and hope this is the correct of the batchpublish issues to add this in)
When using jetstream for delta events, the publishing order is ofc as important as the consuming order (that now is easier to scale with subject-mapping partitions).
I do not understand how to use PublishAsync with a guarantee that the order (at least within a subject (after mapping). That is also how i read this proposal, if only the first x msgs was possible to write, we know they where persisted correct order and none of the following x-batchsize messages was. So order integrity was preserved and we can retry with the rest of them.
I would fit perfectly to do this now when a new client interface is emerging.
/T
No denying the fact that it will require decent chunk of work.
But it will certainly open up more opportunities where NATS can be leveraged as well. Currently, we are blocked on this feature to adopt NATS as an Event Store for Commanded. https://github.com/commanded/commanded/discussions/560
@munjalpatel how many events are you looking to batch? Can they fit into a single message? Meaning you could do what you need now? Or are they all on different subjects?
@derekcollison it really depends on system's use cases. For my use case, we are talking 1000s of different kinds of events.
So each event would be on a different subject correct? And how big is each event in your use case?
Yes, events will need different subjects. Each event will be the and 4 to 5 additional field. So couple of bytes.
Header "Nats-Expected-Last-Msg-Id" should be possible to use to ensure that a SINGLE publisher can publish async full speed and receive failures for all non-accepted messages. Since it's the client that assign the id's, this can be done without any roundtrip and waiting to see what sequence number the message got. So if one message fail, all successive messages will also fail (and we will know what messages we will have to re-published).
However, with partitions it's useless and we would either need to create a stream per partition (where was that template when we needed it) or we )need a "Nats-Expected-Last-Subject-Msg-Id" instead. This would ensure that if one message dropped, only that subject would need to be re-published).
This is only teory-craft, I have no idea of if the "Expected"-checks will slow things down ever more. It should not prevent batching transmissions though.
@derekcollison would this be a solution. Not for transactional safety as in possible rollback, but it will ensure that we can use async freely with id's that are generated in the client.