keras-preprocessing icon indicating copy to clipboard operation
keras-preprocessing copied to clipboard

iterator does not lock access for index_array while using _get_index__ in multiprocess env

Open oak-tree opened this issue 5 years ago • 6 comments

Background

Keras itrator implements __get_index__ as following : https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image/iterator.py#L53

`

def __getitem__(self, idx):
    if idx >= len(self):
        raise ValueError('Asked to retrieve element {idx}, '
                         'but the Sequence '
                         'has length {length}'.format(idx=idx,
                                                      length=len(self)))
    if self.seed is not None:
        np.random.seed(self.seed + self.total_batches_seen)
    self.total_batches_seen += 1
    if self.index_array is None:
        self._set_index_array()
    index_array = self.index_array[self.batch_size * idx:
                                   self.batch_size * (idx + 1)]
    return self._get_batches_of_transformed_samples(index_array)`

This method is being used for async access while using fit_generator with multi process workers. With async access self.index_array and self.total_batches_seen will be seen differently for each worker.

Regular use as generator

Note that this issue does not happen as the next implementation does provide a lock mechanism for multi-thread access for calculating the next batch indices.

oak-tree avatar Jan 06 '20 10:01 oak-tree

self.total_batches_seen is process-independent isn't? If process 1 changes the value, it wouldn't affect process 2.

When using threads, I agree that a race-condition will/may occurs, but it is rare that users use threads.

if self.index_array is None

this statement is should never true when used in fit_generator.

Dref360 avatar Jan 06 '20 14:01 Dref360

@Dref360 Hey, thanks for answering.

total_batches_seen

What would happen if seed is not none? then every worker will get different kind of seed. Moreover, the total_batches_seen is meaningless, isnt? as every async run of get_index will see the same version of total_batches_seen i.e this will never get above 1.

index_array

I agree, my mistake, the workers should never get the chance to update/set index_array because of the fact that index_array is never None.

EDIT

so I think that my question changed to :why is it necessary to update the np.seed from get_index?

oak-tree avatar Jan 06 '20 15:01 oak-tree

What would happen if seed is not none? then every worker will get different kind of seed.

Yeah we probably should use a multiprocess.Value there instead.

version of total_batches_seen i.e this will never get above 1.

Not really. In keras.utils.OrderedEnqueuer we spawn workers processes using a pool. Each worker has a queue of task and the pool resets every epoch. So for each worker, total_batches_seen would be the number of tasks done by this worker for this epoch.

Hope this helps!

Dref360 avatar Jan 06 '20 15:01 Dref360

@Dref360

Not really. In keras.utils.OrderedEnqueuer we spawn workers processes using a pool. Each worker has a queue of task and the pool resets every epoch. So for each worker, total_batches_seen would be the number of tasks done by this worker for this epoch.

Are you sure that the attributes are being changed? Because of the keras.utils.OrderedEnqueuer uses async from pool I think that any change on the instance attributes is not being kept for the next iteration. i.e., every async call is transferred to the state of the iterator instance to the worker in async manners. So changes on this call do not affect the original instance.

oak-tree avatar Jan 07 '20 16:01 oak-tree

Sorry I wasn't clear enough. The original instance doesn't change, but the instance per worker will. As I said, probably not the right behavior.

Dref360 avatar Jan 07 '20 16:01 Dref360

am... important to note that each worker get new copy of the instance every async call. so it wont save the changed attributes

oak-tree avatar Jan 07 '20 17:01 oak-tree