ecosystem icon indicating copy to clipboard operation
ecosystem copied to clipboard

How to avoid aggregate(shuffle) in processing the tfrecord file?

Open mathetian opened this issue 2 years ago • 0 comments

I have a very large tfrecord directory, and need to filter it with some column to generate new tfrecord files.

Code likes that image

When I run it in spark cluster, I find it will run with two steps. image

I check the code in https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-connector/src/main/scala/org/tensorflow/spark/datasources/tfrecords/TensorFlowInferSchema.scala#L39, it have the aggregate steps !

Can I avoid it?

mathetian avatar Dec 24 '22 06:12 mathetian