go icon indicating copy to clipboard operation
go copied to clipboard

slices: add Chunk function to divide []T into [][]T chunks

Open mdlayher opened this issue 2 years ago • 37 comments

A problem I've run into a fair amount is dealing with APIs which only accept a maximum number of inputs at once, though I may have more than that number of inputs that I would like to ultimately process.

For example, Amazon S3 can only delete up to 1000 objects in a single DeleteObjects API request (https://docs.aws.amazon.com/AmazonS3/latest/API/API_DeleteObjects.html). I've run into similar issues in the past when injecting a large number of routes into the Linux kernel via route netlink, or when modifying a large number of WireGuard peers at once.

To deal with this situation in a generic way, I've come up with the following:

package slices

// Chunk batches []T into [][]T in groups of size. The final chunk of []T will be
// smaller than size if the input slice cannot be chunked evenly. It does not
// make any copies of slice elements.
//
// As an example, take a slice of 5 integers and create chunks of 2 integers
// each (the final value creates a short chunk):
//   slices.Chunk([]int{1, 2, 3, 4, 5}, 2) = [][]int{{1, 2}, {3, 4}, {5}}
func Chunk[T any](slice []T, size int) [][]T {
	var chunks [][]T
	for i := 0; i < len(slice); {
		// Clamp the last chunk to the slice bound as necessary.
		end := size
		if l := len(slice[i:]); l < size {
			end = l
		}

		// Set the capacity of each chunk so that appending to a chunk does not
		// modify the original slice.
		chunks = append(chunks, slice[i:i+end:i+end])
		i += end
	}

	return chunks
}

Which is then used as follows:

	objs, err := c.allObjects(ctx)
	if err != nil {
		return nil, err
	}

	// Begin deleting the objects concurrently in batches of 1000 (the
	// maximum limit size supported by S3).
	const limit = 1000

	eg, ctx := errgroup.WithContext(ctx)
	eg.SetLimit(10)

	for _, chunk := range slices.Chunk(objs, limit) {
		chunk := chunk
		eg.Go(func() error { /* deletion logic */ })
	}

If this proposal is accepted, I'd be happy to send a CL for the above. Thanks for your time.

Edit 1: updated second parameter name to size int as inspired by Rust's chunks method: https://doc.rust-lang.org/std/primitive.slice.html#method.chunks.

Edit 2: set capacity for each chunk per next comment.

mdlayher avatar Jul 21 '22 17:07 mdlayher

A small correction to your given implementation:

https://github.com/golang/go/wiki/SliceTricks#batching-with-minimal-allocation

it’s important to use the 3 arg form when chunking, otherwise appending to a chunk could overwrite unintentionally

https://go.dev/play/p/Zp0Dh5cB1rB

JeremyLoy avatar Jul 22 '22 01:07 JeremyLoy

https://go.dev/play/p/EP-wf5dmr3V

sten4eg avatar Jul 22 '22 14:07 sten4eg

I just needed this functionality last week and almost opened an issue. I think it would be useful.

earthboundkid avatar Jul 23 '22 15:07 earthboundkid

I think Batch is a better name.

icholy avatar Jul 25 '22 01:07 icholy

I wonder if this might be better to wait on a general iterator API for. If one of the main intentions is to use it with range, this is a fairly inefficient way to do it, but its existence will encourage people to use it anyways. Map() and Filter() were dropped from slices for the same reason. This could be fairly easily done manually with a for loop in an efficient way, however, if there was some easier way to clamp that last index:

for start := 0; start < len(slice); start += chunkSize {
  end := min(start + chunkSize, len(slice))
  chunk := slice[start:end:end]

  // ...
}

Is there a common use for this that doesn't involve immediately proceeding to loop over it?

DeedleFake avatar Jul 25 '22 07:07 DeedleFake

Ive needed this function for 2 out of the last 3 projects I've worked on. An iterator based approach would not have had any benefit in either of these use cases.

icholy avatar Jul 31 '22 01:07 icholy

But would an iterator-based approach have been worse, @icholy? The inner slices don't need to be reallocated in either case, and the outer slice does need to be newly allocated in either case, so I don't think that an iterator-based variant would essentially ever be worse. If there's some iters.ToSlice() function, then it would only take one extra function call to get a slice if that would be easier to work with for a specific use-case.

DeedleFake avatar Aug 01 '22 14:08 DeedleFake

You probably want to be explicit about your panic condition and panic in one additional case where size == 0

if size < 1 {
	panic("chunk size cannot be less than 1")
}

Here's an alternate implementation that some may like more

func Chunk[T any](slice []T, size int) (chunks [][]T) {
	if size < 1 {
		panic("chunk size cannot be less than 1")
	}
	for i := 0; ; i++ {
		next := i * size
		if len(slice[next:]) > size {
			end := next + size
			chunks = append(chunks, slice[next:end:end])
		} else {
			chunks = append(chunks, slice[i*size:])
			return
		}
	}
}

kf6nux avatar Aug 23 '22 17:08 kf6nux

Ah, nice to see a proposal for this. I've just been using my personal implementation:

func Chunk[E any](values []E, size int) [][]E {
	if size <= 0 {
		panic("size must be > 0")
	}

	var chunks [][]E
	for remaining := len(values); remaining > 0; remaining = len(values) {
		if remaining < size {
			size = remaining
		}

		chunks = append(chunks, values[:size:size])
		values = values[size:]
	}

	return chunks
}

dcormier avatar Jan 06 '23 16:01 dcormier

A group of Go developers at my company much need this function.

In my opinion, I run into this frequently when calling web-based API, but the downside is that the implementation is error-prone, such as an out of range error or an off-by-one error.

My usecase:

For example, the Azure Monitor API has a metricNames parameter, but it is limited to a maximum number of 20 names(undocumented yet) in a request.

Also, in case of the legacy (binary) APNs-like protocol, to improve performance, chunked multiple request packets may be written to the network via a socket at once.

lufia avatar Jul 19 '23 07:07 lufia

If we were going to add this function, I think we'd want to use the new iterator functions and write it as

func Chunk[T any](slice []T, size int) iter.Seq[[]T]

(See #61897 for the definition of iter.Seq.)

rsc avatar Aug 09 '23 16:08 rsc

@rsc thanks for chiming in on this old issue that existed before the new iterator proposal!

your comment got me thinking - what about an additional chunk function that accepted a sequence of T?

func ChunkSeq[T any](seq iter.Seq[T], size int) iter.Seq[[]T]

Do you think this would be useful too? I think chunking on a stream has immediate uses - a batch handler for a message queue for instance

JeremyLoy avatar Aug 09 '23 18:08 JeremyLoy

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

rsc avatar Aug 09 '23 21:08 rsc

Finishing this proposal discussion is blocked on https://github.com/golang/go/issues/61405.

rsc avatar Aug 30 '23 17:08 rsc

I often run into a very similar but slightly different use case -- the 2nd argument is "n" (for n slices of roughly the same size) instead of "size" (with possibly one odd slice at the end).

Something like this --

// Chunks breaks the given slice into n slices of (almost) the same size.
func Chunks[E any](s []E, n int, yield func(int, []E)) {
	if n <= 0 {
		return
	}
	var (
		size, remainder = len(s) / n, len(s) % n
		start, end      int
	)
	for i := 0; i < n; i++ {
		start, end = end, end+size
		if remainder > 0 {
			remainder--
			end++
		}
		yield(i, s[start:end])
	}
}

avamsi avatar Aug 31 '23 04:08 avamsi

@avamsi that sounds more like a Partition function.

icholy avatar Aug 31 '23 13:08 icholy

Chunk and Partition are both useful but in different scenarios. Chunk can work on an iterator, whereas Partition needs a slice. I think they both come up enough to be added.

earthboundkid avatar Aug 31 '23 14:08 earthboundkid

I think this could be generalized further:

func Split[T any](seq iter.Seq[T], when func(len int, a, b T) bool) iter.Seq[iter.Seq[T]] {
  //...
}

The when predicate returns true if it should start a new subsequence after a and before b. len is the length of the current subsequence and len > 0 as a will have already been yielded (yelt?). If the original sequence yields less than two values, when is never called.

Then Chunk is just:

func Chunk[T any](maxLen int, seq iter.Seq[T]) iter.Seq[iter.Seq[T]] {
  return Split(seq, func(len int, _, _ T) bool {
    return len == maxLen
  })
}

It's also easy to write a function that chunks the input based on a cost function instead of length:

func SplitByCost[T any](maxCost int, seq iter.Seq[T], costOf func(T) uint) iter.Seq[iter.Seq[T]] {
  var cost uint
  return Split(seq, func(len int, a, _ T) bool {
    if len == 1 {
      cost = 0
    }
    cost += costOf(a)
    return cost >= maxCost
  })
}

Or to split if some distance threshold is crossed:

func CloseTogether(threshold int, seq iter.Seq[int]) iter.Seq[iter.Seq[int]] {
  return Split(seq, func(_, a, b int) bool {
    return b - a > threshold
  })
}

Partition would be a bit more involved but still simple to fit into this scheme.

It might be worth it to have something similar for the special case of dividing a slice into subslices of the same backing store as that would be much cheaper.

jimmyfrasche avatar Nov 20 '23 23:11 jimmyfrasche

Iterators are unblocked for Go 1.23 so let's figure out what to do with this. It seems like https://github.com/golang/go/issues/53987#issuecomment-1671799092 is the current winner: still a simple API, but not having to allocate the entire slice-of-slices. Do I have that right?

// Chunk returns an iterator over consecutive sub-slices of up to n elements of x.
// All but the last sub-slice will have size n.
// If x is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[T any](x []T, n int) iter.Seq[[]T]

More complex splitting logic is possible and sometimes useful, of course, but the max-n chunking is very common and seems like it can stand on its own.

rsc avatar Jan 24 '24 18:01 rsc

// Chunk returns an iterator over consecutive sub-slices of up to n elements of x.
// All but the first sub-slice will have size n.
// If x is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[T any](x []T, n int) iter.Seq[[]T]

Is that a typo for "All but the last sub-slice will have size n"?

earthboundkid avatar Jan 24 '24 18:01 earthboundkid

Yes, fixed, thanks.

rsc avatar Jan 24 '24 18:01 rsc

@rsc I do think accepting an iter.Seq[T] is important too, that way this Chunk function (and other iterator adapters) can be used as pipeline steps. Would that negatively affect the ergonomics of using Chunk? Admittedly I stopped following this proposal closely a few months ago

JeremyLoy avatar Jan 24 '24 18:01 JeremyLoy

That might belong more in #61898. This is package slices.

rsc avatar Jan 24 '24 18:01 rsc

Ah yes, it absolutely would. Thanks for pointing me towards that, I didn't know it existed!

JeremyLoy avatar Jan 24 '24 18:01 JeremyLoy

Where the OP suggests:

		// Set the capacity of each chunk so that appending to a chunk does not
		// modify the original slice.
		chunks = append(chunks, slice[i:i+end:i+end])

If this is the desired implementation it seems worth mentioning in the doc string.

AndrewHarrisSPU avatar Jan 24 '24 22:01 AndrewHarrisSPU

Yes, we should mention that the slices are sub-slices of the original and that they are clipped to remove excess capacity.

rsc avatar Jan 31 '24 18:01 rsc

Have all remaining concerns about this proposal been addressed?

// Chunk returns an iterator over consecutive sub-slices of up to n elements of x.
// All but the last sub-slice will have size n.
// All sub-slices are clipped to have no capacity beyond the length.
// If x is empty, the sequence is empty: there is no empty slice in the sequence.
func Chunk[T any](x []T, n int) iter.Seq[[]T]

rsc avatar Feb 02 '24 18:02 rsc

Sorry for coming in late. Is there way to make it clear that the n int parameter is the max size of a chunk (besides reading the doc) and not the number of chunks ?

Some time back, I created https://github.com/veggiemonk/batch In this repository I explained why, for most use cases, one might want evenly distributed slices. In any case, this is not mutually exclusive.

Do you think I should add a proposal for Batch or can I piggy back on this one ? Naming might need some work, I need help on that, which is why I ask here. Batch <> Slices <> Chunks ...

// Batch returns an iterator over consecutive sub-slices evenly n number of batches.
// The size of each batch never deviates more than 1 from the average batch size.
func Batch[T any](x []T, n int) iter.Seq[[]T] 

veggiemonk avatar Feb 05 '24 10:02 veggiemonk

@veggiemonk, I think what you're suggesting is the same as https://github.com/golang/go/issues/53987#issuecomment-1700345945? @icholy (and @earthboundkid) suggested it be called Partition (does some other language / library use this terminology?), I renamed it to Shard in my kitchen sink library (https://github.com/avamsi/ergo/blob/bc8d6f210e71d95a8e63c513c49ed5f18d8e9710/slices/slices.go#L6)

avamsi avatar Feb 05 '24 10:02 avamsi

@avamsi yes it is.

I'm not really familiar how other languages do it but here is https://docs-lodash.com/v4/partition/ indeed partition is one name considered but it seems more like a predicate/function is needed to define the split. This is very generic and goes beyond the simple use case we have.

So far, Batch is only term that seems to fit but my first language is not English so I'll leave it to someone native to decide.

veggiemonk avatar Feb 05 '24 10:02 veggiemonk