tfx
tfx copied to clipboard
DataAccessor tf_dataset_factory extremely slow
Correct me if I'm wrong, I'm fairly certain the tutorial examples that call data_accessor.tf_dataset_factory
with a tfxio.TensorFlowDatasetOptions
is using this here that's calling make_tf_record_dataset
. I don't see where the attributes prefetch_buffer_size
and parser_num_threads
documented in TensorFlowDatasetOptions
are being used. Perhaps this is intentional?
But I'm finding dataset processing through data_accessor.tf_dataset_factory
extremely slow. My production training is heavily bottlenecked but this. There's a number of things that could be done here but I don't have granular access to the method internals.
For example, the call to tf.data.experimental.parse_example_dataset
here could use the parser_num_threads
attribute as num_parallel_calls
but it's not? deterministic
is left as the default even though the sloppy_ordering
parameter from TensorFlowDatasetOptions
could be used?
Are there work around for this?
I don't see where the attributes prefetch_buffer_size and parser_num_threads documented in TensorFlowDatasetOptions are being used. Perhaps this is intentional?
They are not used in that particular util, but are used in other places (including some internal code). They may have been added after introducing the TensorflowDatasetOptions
and only used in the functionality that they were added for.
For example, the call to tf.data.experimental.parse_example_dataset here could use the parser_num_threads attribute as num_parallel_calls but it's not? deterministic is left as the default even though the sloppy_ordering parameter from TensorFlowDatasetOptions could be used?
Feel free to send out PR that updates make_tf_record_dataset
with the parameters you mentioned and add me. I don't think there's good reason for not using them (defaults should be backwards compatible though).
In particular I think it could use a .cache()
call somewhere.