particleflow
particleflow copied to clipboard
tensorflow dataset iteration is slow when IO-bound in a single process
Evaluating the dataset number of steps in tensorflow is currently slow when the loop is IO-bound in a single process, because we use
tf.data.Dataset.from_generator, which uses Python underneath and doesn't release the GIL.
See here: https://github.com/jpata/particleflow/blob/58001fa7d850c20b7d50d696478926ab9be8a41f/mlpf/tfmodel/datasets/BaseDatasetFactory.py#L79C1-L84
It might require some changes upstream in tfds to support ArrayRecordDataSource.as_dataset: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/file_adapters.py#L209