itertools icon indicating copy to clipboard operation
itertools copied to clipboard

Add `batch_by`, to allow grouping elements together by their size

Open JosephLenton opened this issue 1 year ago • 2 comments

Hey, I would like to propose adding a method batch_by (I don't know if this is the best name, however I'll use it here).

  • batch_by would allow you to group elements by their size.
  • You provide a maximum size, and a batch of elements aims to be less than that size.
  • If an element is too big, it is placed into a batch of one on it's own ... I'm not 100% sure if this should be the correct behaviour but it is what I have used in the past.
  • The batches preserve the order of the Iterator; it builds a batch as it goes and does not look ahead to find items to fit past batches.
  • I would also propose try_batch_by for dealing with Iterators of Result.

The function signature would be something like:

fn batch_by<N: Num>(max_batch_size: N, batch_fun: F) -> BatchBy<Self, V, F, N>
where
        Self: Sized,
        F: Fn(&Self::Item) -> N,
        N: Num + PartialOrd + Copy,

(I'm not 100% sure on the need for Num. I used it in an implementation I wrote, however I believe it could probably be dropped.)

Pseudo Example Code

const MAX_BATCH_SEND_SIZE: usize = 100;

// Data we are sending
let data_to_send = vec![
    "short-data",
    "very-very-very-...-very-long-data",
    "medium-length-data",
    "short-again",
    "medium-length-data-again",

    // ... imagine lots more data ...
];

// This is the function in use
let batches = data_to_send
    .into_iter()
    .batch_by(MAX_BATCH_SEND_SIZE, |data| data.len());

// Send batches of data
for batch in batches {
    let batch_to_send = batch.collect::<Vec<&'static str>>();
    send(batch_to_send).await?;
}

Motivations

I have personally needed this on several projects when grouping things to be sent on to an external service. For example on a real world project, I needed batch data into 10mb groups to send to ElasticSearch.

Comments

AFAIK there is nothing like this in Itertools. There are ways to group by key, or to find the largest or smallest. A means to say 'put these into groups of 10mb or less' .

I have written this before, so if there is interest I would be more than happy to write a PR for Itertools.

JosephLenton avatar Sep 18 '24 00:09 JosephLenton

Hi, thanks for the idea.

I wonder if chunk_by, batching or (a generalized) coalesce might do the job, too. Did you try these and see how far you get?

Related: https://github.com/rust-itertools/itertools/pull/436

phimuemue avatar Sep 18 '24 09:09 phimuemue

One can achieve this using chunk_by by counting generations:

let mut current_batch_size = 0;
let mut generation = 0;
let batches = data_to_send
    .into_iter()
    .chunk_by(move |data| {
        let size = data.len();

        if current_batch_size + size > MAX_BATCH_SEND_SIZE {
            current_batch_size = 0;
            generation += 1;
        }

        current_batch_size += size;

        generation
    });

I actually ended up writing this twice due to getting the logic wrong the first time. In practice I would end up putting this into a helper function, so I can wrap that with tests.

JosephLenton avatar Sep 19 '24 23:09 JosephLenton