keras-data-generator icon indicating copy to clipboard operation
keras-data-generator copied to clipboard

A problem when the sample_size is not divisible by the batch_size

Open MounirB opened this issue 4 years ago • 5 comments

I adapted DataGenerator to my Deep Learning pipeline. When the sample size is not divisible by the batch_size, the DataGenerator seems to return to the first batch without taking into account the last (smaller) batch.

Example Let A be an array of train samples, and batch_size = 4. A = [4,7,8,7,9,78,8,4,78,51,6,5,1,0]. Here A.size = 14 It is clear, in this situation, that A.size is not divisible by batch_size.

The batches the DataGenerator yields during the training process are the following :

  • Batch_0 = [4,7,8,7],
  • Batch_1 = [9,78,8,4]
  • Batch_2 = [78,51,6,5]
  • Batch_3 = [4,7,8,7] This is where the problem lies. Instead of having Batch_3 = [1,0]. It goes back to the first batch

Here is a situation where an other generator behaves well when the sample_size is not divisible by the batch_size https://stackoverflow.com/questions/54159034/what-if-the-sample-size-is-not-divisible-by-batch-size-in-keras-model

For your information, I kept as is the following instruction int(np.floor(len(self.list_IDs) / self.batch_size)) If I change np.floor to np.ceil, it seems to bug during the training/validation phases.

MounirB avatar Sep 10 '20 10:09 MounirB

As far as I know, the problem lies in __data_generation : it creates X and y of the size self.batch_size. This isn‘t a problem when you use np.floor because then all batches are of the same size but you lose the last batch when the size of you data is not divisible by the batch_size. If it is not divisible X and y get „filled“ with as much data as the last batch provides but the rest stays empty. This empty rest is the problem because the model can‘t fit this (uninitialized (arbitrary)) data. To prevent this from happening when using np.ceil use e.g. true_size = len(list_IDs_temp) X = np.empty((true_size, *self.dim, self.channels)) y = np.empty((true_size), dtype=int) in https://github.com/afshinea/keras-data-generator/blob/866cce89f7737866acea567b34bc0997f1bc1531/my_classes.py#L42-L46 to always get a np.empty array of the matching size.

gwirn avatar Apr 14 '22 08:04 gwirn

FWIW, I think you have 2 options:

  1. use math.floor(len(...) / batch_size) — this keeps all batch sizes the same, but you skip over the data past the last full-size batch (if you shuffle the indices between every epoch, you'll probably cover them eventually)
  2. use math.ceil(len(...) / batch_size) — if so, you want the last batch to be smaller, rather than index past the end of the array and wrap around to the beginning of the array (or cause out-of-bounds errors)

If you use approach (2), then you have to cap the upper bound of the batch, i.e., instead of:

https://github.com/afshinea/keras-data-generator/blob/866cce89f7737866acea567b34bc0997f1bc1531/my_classes.py#L25-L26

which can be rewritten as:

low = index * self.batch_size
high = (index+1) * self.batch_size
indexes = self.indexes[low:high]

you may want to use something like:

low = index * self.batch_size
high = min((index + 1) * self.batch_size, len(self.list_IDs))
indexes = self.indexes[low:high]

to cap at the length of the array; the last batch may be smaller if the total number of items is not a multiple of batch size.

And if you want to micro-optimize, you can replace a multiplication with an addition:

low = index * self.batch_size
high = min(low + self.batch_size, len(self.list_IDs))
indexes = self.indexes[low:high]

I wrote some helper classes (with tests) which can be dropped in to provide the data slicing, or you can incorporate them into your Sequence subclasses directly.

mbrukman avatar Dec 06 '22 03:12 mbrukman

@mbrukman Thank you for the nice indexing suggestion! I have two questions:

  • Is the calculation of high really necessary as slicing over the end of an array/list like
a = np.array([1, 2, 3, 4])
print(a[2:6])

would still result in [2 3]

  • Aren't you still creating a too big X, y in https://github.com/afshinea/keras-data-generator/blob/866cce89f7737866acea567b34bc0997f1bc1531/my_classes.py#L45-L46 when your last batch is smaller than self.batch_size since the high and low doesn't change the batch size. So when your last batch is smaller than self.batch_size the X and y will still be filled with (some) entries from the initial np.empty which is a problem as pointed out here: numpy empty

gwirn avatar Dec 06 '22 07:12 gwirn

@gwirn wrote:

  • Is the calculation of high really necessary as slicing over the end of an array/list like
a = np.array([1, 2, 3, 4])
print(a[2:6])

would still result in [2 3]

I think it returns [3, 4] but your point is valid: indexing past the end of a native array in Python or a NumPy array is valid, but I'm used to computing my array bounds precisely because I use a variety of languages, in some of which, indexing outside of array bounds is invalid (and either throws an exception, or crashes, or causes undefined behavior), so I aim to be precise everywhere, for consistency (and peace of mind).

  • Aren't you still creating a too big X, y in https://github.com/afshinea/keras-data-generator/blob/866cce89f7737866acea567b34bc0997f1bc1531/my_classes.py#L45-L46

    when your last batch is smaller than self.batch_size since the high and low doesn't change the batch size. So when your last batch is smaller than self.batch_size the X and y will still be filled with (some) entries from the initial np.empty which is a problem as pointed out here: numpy empty

Just to clarify, I am not creating too big X or y because this is not my repo and it's not my code, but you're correct, if we're going to make the last batch smaller than the rest, then we have to also fix the initialization to create an array that's the correct size.

My implementation of Sequence creates a smaller last batch and doesn't initialize X and y to be of size batch_size; please see the code references I linked to in my earlier response.

I think some of the sample code here and in the blog post need to be updated to handle these cases; I opened a separate issue https://github.com/afshinea/keras-data-generator/issues/7 to ask about the license for this repo (since there isn't one now) so that we can contribute some fixes for the code.

mbrukman avatar Jan 04 '23 23:01 mbrukman

@mbrukman Yes you are right [3, 4] not [2, 3] that‘s a typo. Thank you for your detailed answer!

gwirn avatar Jan 05 '23 06:01 gwirn