keras-preprocessing
                                
                                
                                
                                    keras-preprocessing copied to clipboard
                            
                            
                            
                        iterator does not lock access for index_array while using _get_index__ in multiprocess env
Background
Keras itrator implements __get_index__ as following :  https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image/iterator.py#L53
`
def __getitem__(self, idx):
    if idx >= len(self):
        raise ValueError('Asked to retrieve element {idx}, '
                         'but the Sequence '
                         'has length {length}'.format(idx=idx,
                                                      length=len(self)))
    if self.seed is not None:
        np.random.seed(self.seed + self.total_batches_seen)
    self.total_batches_seen += 1
    if self.index_array is None:
        self._set_index_array()
    index_array = self.index_array[self.batch_size * idx:
                                   self.batch_size * (idx + 1)]
    return self._get_batches_of_transformed_samples(index_array)`
This method is being used for async access  while using fit_generator with multi process workers.  With async access self.index_array and self.total_batches_seen will be seen differently for each worker.
Regular use as generator
Note that this issue does not happen as the next implementation does provide a lock mechanism for multi-thread access for calculating the next batch indices.
self.total_batches_seen is process-independent isn't? If process 1 changes the value, it wouldn't affect process 2.
When using threads, I agree that a race-condition will/may occurs, but it is rare that users use threads.
if self.index_array is None
this statement is should never true when used in fit_generator.
@Dref360 Hey, thanks for answering.
total_batches_seen
What would happen if seed is not none? then every worker will get different kind of seed. Moreover, the total_batches_seen is meaningless, isnt? as every async run of get_index will see the same version of total_batches_seen i.e this will never get above 1.
index_array
I agree, my mistake, the workers should never get the chance to update/set index_array because of the fact that index_array is never None.
EDIT
so I think that my question  changed to :why  is it  necessary to  update the np.seed from get_index?
What would happen if seed is not none? then every worker will get different kind of seed.
Yeah we probably should use a multiprocess.Value there instead.
version of total_batches_seen i.e this will never get above 1.
Not really. In keras.utils.OrderedEnqueuer we spawn workers processes using a pool.
Each worker has a queue of task and the pool resets every epoch.
So for each worker, total_batches_seen would be the number of tasks done by this worker for this epoch.
Hope this helps!
@Dref360
Not really. In keras.utils.OrderedEnqueuer we spawn workers processes using a pool. Each worker has a queue of task and the pool resets every epoch. So for each worker, total_batches_seen would be the number of tasks done by this worker for this epoch.
Are you sure that the attributes are being changed? Because of the keras.utils.OrderedEnqueuer uses async from pool I think that any change on the instance attributes is not being kept for the next iteration. i.e., every async call is transferred to the state of the iterator instance to the worker in async manners. So changes on this call do not affect the original instance.
Sorry I wasn't clear enough. The original instance doesn't change, but the instance per worker will. As I said, probably not the right behavior.
am... important to note that each worker get new copy of the instance every async call. so it wont save the changed attributes