java
java copied to clipboard
Add TFRecordWriter
Please add support for writing TFRecords. I believe the corresponding python class is tf.io.TFRecordWriter.
These are notes for whoever tries to implement this.
The Python class calls through to the internal C++ API which doesn't really do much other than add the appropriate header & footer bytes to an open file handle , and those header & footer bytes are basically the length, a crc of the length, the data as a byte array and then a crc of the data (https://www.tensorflow.org/tutorials/load_data/tfrecord#tfrecords_format_details). The data is the byte array from a TF Example protobuf which we already have in TF Java. As usual the C++ API class isn't exposed in the C API, but at least in this case we can use Java's built in CRC32 and then mask it using the formula from the docs, and then we should be able to write it out fairly easily. They encode the length using this code which we should also be able to replicate pretty easily.
Or, it's also possible to access the C++ API using JavaCPP: https://github.com/bytedeco/javacpp-presets/blob/master/tensorflow/src/gen/java/org/bytedeco/tensorflow/RecordWriter.java
See also https://github.com/tensorflow/ecosystem/blob/master/hadoop/src/main/java/org/tensorflow/hadoop/util/TFRecordWriter.java
Ah I'd missed that it was CRC32 but using a different polynomial than the one in Java. Bit tedious to have to depend on Apache Commons Codec to get access to that version of CRC, but at least it doesn't bring anything else in with it. We could probably bring that class across as it's part of TF already and fix up the entry points so they are a little nicer (i.e. accept an Example protobuf rather than a byte array). @karllessard any objections to adding Apache Commons Codec as a dependency of tensorflow-framework?
FWIW, it would be less maintenance to use the C++ API, since that's what Python uses anyway, the code is already there, and it guarantees compatibility, which is the goal here.
I'm reluctant to depend on the internal TF C++ API as we don't have a team of people to track it when it changes. The C++ API changes at the whims of the Python library, it's not depended on by other projects or languages, so we've got little scope for influencing it. If they stabilise it as part of the modular TF or TF-core initiatives then it's much more palatable.
In the case of RecordWriter I wouldn't expect it to change too much, but Apache Commons Codec isn't going to change either, and the overall code size changes for including Apache Commons Codec are probably a lot smaller given how big the preset is you linked to.
We don't need to wrap everything, it's probably going to work just by including only that one small header file.
Should we also have TFRecordReader?
Reader and Writer both will be useful!
In any case, if someone wants to use what's available in TF Core just like the Python API is doing, let me know and I'll be happy to add and maintain the necessary couple of header files for JavaCPP!
Wh can't we copy from https://github.com/tensorflow/ecosystem/blob/master/hadoop/src/main/java/org/tensorflow/hadoop/util
TFRecordReader.java
TFRecordWriter.java
Crc32C.java
Crc32C uses org.apache.commons.codec.digest.PureJavaCrc32C