MLDataPattern.jl icon indicating copy to clipboard operation
MLDataPattern.jl copied to clipboard

More ways to handle non-divisable batch sizes

Open oxinabox opened this issue 7 years ago • 4 comments

After #9, I was thinking about the ways one can handle non-dividable batch-size. I think MLDataPattern could do with 1 or 2 more. I will enumerate them here for consideration.

Truncate (current default).

Cut batchs of given size from the full set. Discard any remainder

Round size down (current upto/max size)

Decrease the size until it reaches a divisor. worse-case terminates when size==1, i.e. online

Useful for ensuring maximum amount of memory used per batch.

This is equiv to round count up. It sets a minimum number of batchs.

Round size up

Increase the size until it reaches a divisor. Worst-case terminates when size==n_obs, i.e. full-batch

Useful for ensuring a minimum number of observations to be considered be batch.

This is equiv to round count down. It sets a maximum number of batchs.

Round size nearest

Alternatively consider increasing and decreasing size, until it reaches a divisor.

Worse-cast terminates when size reaches the nearer of 1 or n_obs.

It gives Batches of closest possible even size to that given by user.

Remainder Batch

Take batchs of full size, then adds an extra batch that is undersized to the end, containing the remainder.

Uneven Batch Sizes

Increase the size of some of the batches, to eat the remainder of the division.

Assuming one's algorithm can handle varying sized batches, this is probably the ideal.

To sketch the calculations out:

function uneven_sizes(data, size)
    n_observations = n_obs(data) 
    n_batchs = n_observations  ÷ size  
    remainder = n_observations % size
    batch_sizes = fill(n_batchs, size)
    @assert(remainder < size)
    everywhere_extra = remainder ÷ n_batchs
    extra_extra = remainder % n_batchs
    @assert(extra_extra < n_batches)
    batch_sizes[:]+=everywhere_extra
    batch_sizes[1:extra_extra]+=everywhere_extra
    batch_sizes
end

It can also be done to directly calculate index positions, and even lazily, using CatViews of UnitRanges.


So that is 6 options, of which two are already implemented. I'm not sure that all are required, though.

There are another couple more. Eg

  • Uneven Batches, down except at the start you increase the number of batchs by 1, then shrink the batch_size.
  • Extra Batched, down: except instead of having an undersized batch at the end, you make the final batch oversized by appending the remainder.

I'm pretty sure there are all sanely implementable inside the datasubset paradigm, which is a positive sign for the overall architecture.

oxinabox avatar Jun 06 '17 03:06 oxinabox

Thanks for the detailed proposal. I agree that more options would be a good thing. I am pretty sure this would need a refactor of _compute_batch_settings to implement this cleanly.

Evizero avatar Jun 06 '17 08:06 Evizero

I agree. Possibly it would be good to make BatchingMode a type, and then dispatch to a distinct _compute_batch_settings per the each

oxinabox avatar Jun 06 '17 08:06 oxinabox

I was entertaining the same idea. Internally use some BatchingMode and for the user interface expose convenient keyword arguments

Evizero avatar Jun 06 '17 08:06 Evizero

statistically speaking, i think if one is randomly shuffling data before batching each epoch (which one really should do), it doesn't really matter how this case is handled. Everything will be seen enough times.

oxinabox avatar Jan 23 '20 12:01 oxinabox