particleflow icon indicating copy to clipboard operation
particleflow copied to clipboard

tensorflow dataset iteration is slow when IO-bound in a single process

Open jpata opened this issue 2 years ago • 0 comments

Evaluating the dataset number of steps in tensorflow is currently slow when the loop is IO-bound in a single process, because we use tf.data.Dataset.from_generator, which uses Python underneath and doesn't release the GIL.

See here: https://github.com/jpata/particleflow/blob/58001fa7d850c20b7d50d696478926ab9be8a41f/mlpf/tfmodel/datasets/BaseDatasetFactory.py#L79C1-L84

It might require some changes upstream in tfds to support ArrayRecordDataSource.as_dataset: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/file_adapters.py#L209

jpata avatar Sep 14 '23 09:09 jpata