tfx DataAccessor tf_dataset_factory extremely slow

DataAccessor tf_dataset_factory extremely slow

Open feelingsonice opened this issue 2 years ago • 2 comments

Correct me if I'm wrong, I'm fairly certain the tutorial examples that call data_accessor.tf_dataset_factory with a tfxio.TensorFlowDatasetOptions is using this here that's calling make_tf_record_dataset. I don't see where the attributes prefetch_buffer_size and parser_num_threads documented in TensorFlowDatasetOptions are being used. Perhaps this is intentional?

But I'm finding dataset processing through data_accessor.tf_dataset_factory extremely slow. My production training is heavily bottlenecked but this. There's a number of things that could be done here but I don't have granular access to the method internals.

For example, the call to tf.data.experimental.parse_example_dataset here could use the parser_num_threads attribute as num_parallel_calls but it's not? deterministic is left as the default even though the sloppy_ordering parameter from TensorFlowDatasetOptions could be used?

Are there work around for this?

Apr 11 '22 22:04 feelingsonice

I don't see where the attributes prefetch_buffer_size and parser_num_threads documented in TensorFlowDatasetOptions are being used. Perhaps this is intentional?

They are not used in that particular util, but are used in other places (including some internal code). They may have been added after introducing the TensorflowDatasetOptions and only used in the functionality that they were added for.

For example, the call to tf.data.experimental.parse_example_dataset here could use the parser_num_threads attribute as num_parallel_calls but it's not? deterministic is left as the default even though the sloppy_ordering parameter from TensorFlowDatasetOptions could be used?

Feel free to send out PR that updates make_tf_record_dataset with the parameters you mentioned and add me. I don't think there's good reason for not using them (defaults should be backwards compatible though).

Apr 11 '22 23:04 iindyk

In particular I think it could use a .cache() call somewhere.

Apr 12 '22 00:04 feelingsonice

tfx tfx copied to clipboard

DataAccessor tf_dataset_factory extremely slow

tfx
tfx copied to clipboard