ecosystem
ecosystem copied to clipboard
How to avoid aggregate(shuffle) in processing the tfrecord file?
I have a very large tfrecord directory, and need to filter it with some column to generate new tfrecord files.
Code likes that
When I run it in spark cluster, I find it will run with two steps.
I check the code in https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-connector/src/main/scala/org/tensorflow/spark/datasources/tfrecords/TensorFlowInferSchema.scala#L39
, it have the aggregate steps !
Can I avoid it?