keras-preprocessing
keras-preprocessing copied to clipboard
iterator does not lock access for index_array while using _get_index__ in multiprocess env
Background
Keras itrator
implements __get_index__
as following : https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image/iterator.py#L53
`
def __getitem__(self, idx):
if idx >= len(self):
raise ValueError('Asked to retrieve element {idx}, '
'but the Sequence '
'has length {length}'.format(idx=idx,
length=len(self)))
if self.seed is not None:
np.random.seed(self.seed + self.total_batches_seen)
self.total_batches_seen += 1
if self.index_array is None:
self._set_index_array()
index_array = self.index_array[self.batch_size * idx:
self.batch_size * (idx + 1)]
return self._get_batches_of_transformed_samples(index_array)`
This method is being used for async access
while using fit_generator
with multi process workers
. With async
access self.index_array
and self.total_batches_seen
will be seen differently for each worker.
Regular use as generator
Note that this issue does not happen as the next
implementation does provide a lock mechanism for multi-thread access for calculating the next batch indices.
self.total_batches_seen is process-independent isn't? If process 1 changes the value, it wouldn't affect process 2.
When using threads, I agree that a race-condition will/may occurs, but it is rare that users use threads.
if self.index_array is None
this statement is should never true when used in fit_generator.
@Dref360 Hey, thanks for answering.
total_batches_seen
What would happen if seed
is not none? then every worker will get different kind of seed. Moreover, the total_batches_seen
is meaningless, isnt? as every async
run of get_index
will see the same version of total_batches_seen
i.e this will never get above 1.
index_array
I agree, my mistake, the workers should never get the chance to update/set index_array
because of the fact that index_array
is never None.
EDIT
so I think that my question changed to :why is it necessary to update
the np.seed
from get_index
?
What would happen if seed is not none? then every worker will get different kind of seed.
Yeah we probably should use a multiprocess.Value there instead.
version of total_batches_seen i.e this will never get above 1.
Not really. In keras.utils.OrderedEnqueuer we spawn workers
processes using a pool.
Each worker has a queue of task and the pool resets every epoch.
So for each worker, total_batches_seen
would be the number of tasks done by this worker for this epoch.
Hope this helps!
@Dref360
Not really. In keras.utils.OrderedEnqueuer we spawn workers processes using a pool. Each worker has a queue of task and the pool resets every epoch. So for each worker, total_batches_seen would be the number of tasks done by this worker for this epoch.
Are you sure that the attributes are being changed? Because of the keras.utils.OrderedEnqueuer
uses async
from pool
I think that any change on the instance attributes is not being kept for the next iteration. i.e., every async
call is transferred to the state
of the iterator instance
to the worker in async manners. So changes on this call do not affect the original instance.
Sorry I wasn't clear enough. The original instance doesn't change, but the instance per worker will. As I said, probably not the right behavior.
am... important to note that each worker get new copy of the instance every async call. so it wont save the changed attributes