go
go copied to clipboard
slices: add Chunk function to divide []T into [][]T chunks
A problem I've run into a fair amount is dealing with APIs which only accept a maximum number of inputs at once, though I may have more than that number of inputs that I would like to ultimately process.
For example, Amazon S3 can only delete up to 1000 objects in a single DeleteObjects
API request (https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html). I've run into similar issues in the past when injecting a large number of routes into the Linux kernel via route netlink, or when modifying a large number of WireGuard peers at once.
To deal with this situation in a generic way, I've come up with the following:
package slices
// Chunk batches []T into [][]T in groups of size. The final chunk of []T will be
// smaller than size if the input slice cannot be chunked evenly. It does not
// make any copies of slice elements.
//
// As an example, take a slice of 5 integers and create chunks of 2 integers
// each (the final value creates a short chunk):
// slices.Chunk([]int{1, 2, 3, 4, 5}, 2) = [][]int{{1, 2}, {3, 4}, {5}}
func Chunk[T any](slice []T, size int) [][]T {
var chunks [][]T
for i := 0; i < len(slice); {
// Clamp the last chunk to the slice bound as necessary.
end := size
if l := len(slice[i:]); l < size {
end = l
}
// Set the capacity of each chunk so that appending to a chunk does not
// modify the original slice.
chunks = append(chunks, slice[i:i+end:i+end])
i += end
}
return chunks
}
Which is then used as follows:
objs, err := c.allObjects(ctx)
if err != nil {
return nil, err
}
// Begin deleting the objects concurrently in batches of 1000 (the
// maximum limit size supported by S3).
const limit = 1000
eg, ctx := errgroup.WithContext(ctx)
eg.SetLimit(10)
for _, chunk := range slices.Chunk(objs, limit) {
chunk := chunk
eg.Go(func() error { /* deletion logic */ })
}
If this proposal is accepted, I'd be happy to send a CL for the above. Thanks for your time.
Edit 1: updated second parameter name to size int
as inspired by Rust's chunks
method: https://doc.rust-lang.org/std/primitive.slice.html#method.chunks.
Edit 2: set capacity for each chunk per next comment.
A small correction to your given implementation:
https://github.com/golang/go/wiki/SliceTricks#batching-with-minimal-allocation
it’s important to use the 3 arg form when chunking, otherwise appending to a chunk could overwrite unintentionally
https://go.dev/play/p/Zp0Dh5cB1rB
https://go.dev/play/p/EP-wf5dmr3V
I just needed this functionality last week and almost opened an issue. I think it would be useful.
I think Batch
is a better name.
I wonder if this might be better to wait on a general iterator API for. If one of the main intentions is to use it with range
, this is a fairly inefficient way to do it, but its existence will encourage people to use it anyways. Map()
and Filter()
were dropped from slices
for the same reason. This could be fairly easily done manually with a for
loop in an efficient way, however, if there was some easier way to clamp that last index:
for start := 0; start < len(slice); start += chunkSize {
end := min(start + chunkSize, len(slice))
chunk := slice[start:end:end]
// ...
}
Is there a common use for this that doesn't involve immediately proceeding to loop over it?
Ive needed this function for 2 out of the last 3 projects I've worked on. An iterator based approach would not have had any benefit in either of these use cases.
But would an iterator-based approach have been worse, @icholy? The inner slices don't need to be reallocated in either case, and the outer slice does need to be newly allocated in either case, so I don't think that an iterator-based variant would essentially ever be worse. If there's some iters.ToSlice()
function, then it would only take one extra function call to get a slice if that would be easier to work with for a specific use-case.
You probably want to be explicit about your panic condition and panic in one additional case where size == 0
if size < 1 {
panic("chunk size cannot be less than 1")
}
Here's an alternate implementation that some may like more
func Chunk[T any](slice []T, size int) (chunks [][]T) {
if size < 1 {
panic("chunk size cannot be less than 1")
}
for i := 0; ; i++ {
next := i * size
if len(slice[next:]) > size {
end := next + size
chunks = append(chunks, slice[next:end:end])
} else {
chunks = append(chunks, slice[i*size:])
return
}
}
}
Ah, nice to see a proposal for this. I've just been using my personal implementation:
func Chunk[E any](values []E, size int) [][]E {
if size <= 0 {
panic("size must be > 0")
}
var chunks [][]E
for remaining := len(values); remaining > 0; remaining = len(values) {
if remaining < size {
size = remaining
}
chunks = append(chunks, values[:size:size])
values = values[size:]
}
return chunks
}
A group of Go developers at my company much need this function.
In my opinion, I run into this frequently when calling web-based API, but the downside is that the implementation is error-prone, such as an out of range error or an off-by-one error.
My usecase:
For example, the Azure Monitor API has a metricNames parameter, but it is limited to a maximum number of 20 names(undocumented yet) in a request.
Also, in case of the legacy (binary) APNs-like protocol, to improve performance, chunked multiple request packets may be written to the network via a socket at once.
If we were going to add this function, I think we'd want to use the new iterator functions and write it as
func Chunk[T any](slice []T, size int) iter.Seq[[]T]
(See #61897 for the definition of iter.Seq.)
@rsc thanks for chiming in on this old issue that existed before the new iterator proposal!
your comment got me thinking - what about an additional chunk
function that accepted a sequence of T?
func ChunkSeq[T any](seq iter.Seq[T], size int) iter.Seq[[]T]
Do you think this would be useful too? I think chunking on a stream has immediate uses - a batch handler for a message queue for instance
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group
Finishing this proposal discussion is blocked on https://github.com/golang/go/issues/61405.
I often run into a very similar but slightly different use case -- the 2nd argument is "n" (for n slices of roughly the same size) instead of "size" (with possibly one odd slice at the end).
Something like this --
// Chunks breaks the given slice into n slices of (almost) the same size.
func Chunks[E any](s []E, n int, yield func(int, []E)) {
if n <= 0 {
return
}
var (
size, remainder = len(s) / n, len(s) % n
start, end int
)
for i := 0; i < n; i++ {
start, end = end, end+size
if remainder > 0 {
remainder--
end++
}
yield(i, s[start:end])
}
}
@avamsi that sounds more like a Partition
function.
Chunk and Partition are both useful but in different scenarios. Chunk can work on an iterator, whereas Partition needs a slice. I think they both come up enough to be added.
I think this could be generalized further:
func Split[T any](seq iter.Seq[T], when func(len int, a, b T) bool) iter.Seq[iter.Seq[T]] {
//...
}
The when
predicate returns true if it should start a new subsequence after a
and before b
. len
is the length of the current subsequence and len > 0
as a
will have already been yielded (yelt?). If the original sequence yields less than two values, when
is never called.
Then Chunk is just:
func Chunk[T any](maxLen int, seq iter.Seq[T]) iter.Seq[iter.Seq[T]] {
return Split(seq, func(len int, _, _ T) bool {
return len == maxLen
})
}
It's also easy to write a function that chunks the input based on a cost function instead of length:
func SplitByCost[T any](maxCost int, seq iter.Seq[T], costOf func(T) uint) iter.Seq[iter.Seq[T]] {
var cost uint
return Split(seq, func(len int, a, _ T) bool {
if len == 1 {
cost = 0
}
cost += costOf(a)
return cost >= maxCost
})
}
Or to split if some distance threshold is crossed:
func CloseTogether(threshold int, seq iter.Seq[int]) iter.Seq[iter.Seq[int]] {
return Split(seq, func(_, a, b int) bool {
return b - a > threshold
})
}
Partition would be a bit more involved but still simple to fit into this scheme.
It might be worth it to have something similar for the special case of dividing a slice into subslices of the same backing store as that would be much cheaper.
Iterators are unblocked for Go 1.23 so let's figure out what to do with this. It seems like https://github.com/golang/go/issues/53987#issuecomment-1671799092 is the current winner: still a simple API, but not having to allocate the entire slice-of-slices. Do I have that right?
// Chunk returns an iterator over consecutive sub-slices of up to n elements of x.
// All but the last sub-slice will have size n.
// If x is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[T any](x []T, n int) iter.Seq[[]T]
More complex splitting logic is possible and sometimes useful, of course, but the max-n chunking is very common and seems like it can stand on its own.
// Chunk returns an iterator over consecutive sub-slices of up to n elements of x. // All but the first sub-slice will have size n. // If x is empty, the sequence is empty: there is no empty slice in the sequence. func Chunk[T any](x []T, n int) iter.Seq[[]T]
Is that a typo for "All but the last sub-slice will have size n"?
Yes, fixed, thanks.
@rsc I do think accepting an iter.Seq[T]
is important too, that way this Chunk
function (and other iterator adapters) can be used as pipeline steps. Would that negatively affect the ergonomics of using Chunk
? Admittedly I stopped following this proposal closely a few months ago
That might belong more in #61898. This is package slices.
Ah yes, it absolutely would. Thanks for pointing me towards that, I didn't know it existed!
Where the OP suggests:
// Set the capacity of each chunk so that appending to a chunk does not
// modify the original slice.
chunks = append(chunks, slice[i:i+end:i+end])
If this is the desired implementation it seems worth mentioning in the doc string.
Yes, we should mention that the slices are sub-slices of the original and that they are clipped to remove excess capacity.
Have all remaining concerns about this proposal been addressed?
// Chunk returns an iterator over consecutive sub-slices of up to n elements of x.
// All but the last sub-slice will have size n.
// All sub-slices are clipped to have no capacity beyond the length.
// If x is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[T any](x []T, n int) iter.Seq[[]T]
Sorry for coming in late.
Is there way to make it clear that the n int
parameter is the max size of a chunk (besides reading the doc) and not the number of chunks ?
Some time back, I created https://github.com/veggiemonk/batch In this repository I explained why, for most use cases, one might want evenly distributed slices. In any case, this is not mutually exclusive.
Do you think I should add a proposal for Batch or can I piggy back on this one ? Naming might need some work, I need help on that, which is why I ask here. Batch <> Slices <> Chunks ...
// Batch returns an iterator over consecutive sub-slices evenly n number of batches.
// The size of each batch never deviates more than 1 from the average batch size.
func Batch[T any](x []T, n int) iter.Seq[[]T]
@veggiemonk, I think what you're suggesting is the same as https://github.com/golang/go/issues/53987#issuecomment-1700345945? @icholy (and @earthboundkid) suggested it be called Partition
(does some other language / library use this terminology?), I renamed it to Shard
in my kitchen sink library (https://github.com/avamsi/ergo/blob/bc8d6f210e71d95a8e63c513c49ed5f18d8e9710/slices/slices.go#L6)
@avamsi yes it is.
I'm not really familiar how other languages do it but here is https://docs-lodash.com/v4/partition/ indeed partition is one name considered but it seems more like a predicate/function is needed to define the split. This is very generic and goes beyond the simple use case we have.
So far, Batch is only term that seems to fit but my first language is not English so I'll leave it to someone native to decide.