MLDataPattern.jl
MLDataPattern.jl copied to clipboard
More ways to handle non-divisable batch sizes
After #9, I was thinking about the ways one can handle non-dividable batch-size. I think MLDataPattern could do with 1 or 2 more. I will enumerate them here for consideration.
Truncate (current default).
Cut batchs of given size
from the full set.
Discard any remainder
Round size down (current upto/max size)
Decrease the size
until it reaches a divisor.
worse-case terminates when size
==1, i.e. online
Useful for ensuring maximum amount of memory used per batch.
This is equiv to round count
up.
It sets a minimum number of batchs.
Round size up
Increase the size
until it reaches a divisor.
Worst-case terminates when size==n_obs
, i.e. full-batch
Useful for ensuring a minimum number of observations to be considered be batch.
This is equiv to round count
down.
It sets a maximum number of batchs.
Round size nearest
Alternatively consider increasing and decreasing size
,
until it reaches a divisor.
Worse-cast terminates when size
reaches the nearer of 1
or n_obs
.
It gives Batches of closest possible even size to that given by user.
Remainder Batch
Take batchs of full size, then adds an extra batch that is undersized to the end, containing the remainder.
Uneven Batch Sizes
Increase the size of some of the batches, to eat the remainder of the division.
Assuming one's algorithm can handle varying sized batches, this is probably the ideal.
To sketch the calculations out:
function uneven_sizes(data, size)
n_observations = n_obs(data)
n_batchs = n_observations ÷ size
remainder = n_observations % size
batch_sizes = fill(n_batchs, size)
@assert(remainder < size)
everywhere_extra = remainder ÷ n_batchs
extra_extra = remainder % n_batchs
@assert(extra_extra < n_batches)
batch_sizes[:]+=everywhere_extra
batch_sizes[1:extra_extra]+=everywhere_extra
batch_sizes
end
It can also be done to directly calculate index positions, and even lazily, using CatViews of UnitRanges.
So that is 6 options, of which two are already implemented. I'm not sure that all are required, though.
There are another couple more. Eg
- Uneven Batches, down except at the start you increase the number of batchs by 1, then shrink the batch_size.
- Extra Batched, down: except instead of having an undersized batch at the end, you make the final batch oversized by appending the remainder.
I'm pretty sure there are all sanely implementable inside the datasubset
paradigm, which is a positive sign for the overall architecture.
Thanks for the detailed proposal. I agree that more options would be a good thing. I am pretty sure this would need a refactor of _compute_batch_settings to implement this cleanly.
I agree.
Possibly it would be good to make BatchingMode
a type,
and then dispatch to a distinct _compute_batch_settings
per the each
I was entertaining the same idea. Internally use some BatchingMode
and for the user interface expose convenient keyword arguments
statistically speaking, i think if one is randomly shuffling data before batching each epoch (which one really should do), it doesn't really matter how this case is handled. Everything will be seen enough times.