io
io copied to clipboard
Support `zstandard` as a compression type for TfRecordDataset
System information
- TensorFlow version (you are using): 2.4
- Are you willing to contribute it (Yes/No): Yes
Describe the feature and the current behavior/state.
Currently: tf.data.TFRecordDataset supports GZIP and ZLIB (and possibly SNAPPY) compression types.
Propose: Support zstandard as a compression type
Will this change the current api? How?
Yes, this will introduce one more compression type to tf.data.TFRecordDataset's compression_type
Who will benefit with this feature?
All users using tfrecords for training via tf.data.TfRecordDataset should be able to benefit from the faster decompression and better compression ratio.
In general, ingestion of training data happens on hosts attached to accelerators where hosts have a limited number of CPUs. In contrast, production of training data is typically massively horizontally scaled. Thus training speed can be bottlenecked by the following: network bandwidth to host, deserialization, decompression, and host to device transfer. Decompression is worth optimizing as one of these steps.
There is the experimental tf.data.service that seeks to address all of these possible bottlenecks but may not be ideal for users that do not wish to deploy a distributed service alongside training.
zstandard decompression speed compared
@yongtang this is what we had spoken briefly about during the TF IO SIO meeting today
@rllin As was discussed in the meeting, the extra compression support likely will be landed in tensorflow-io package. I will transfer the issue to tensorflow/io repo and we can continue the discussion there.
@yongtang oh I may be forgetting. can you remind me why it is thru tensorflow-io? this should fall under tf.data.TfRecordDataset rather than some tf.io.___Dataset.
@rllin Adding additional ZSTANDARD compressions (outside of zlib/gzip) likely falls under tensorflow-io as tensorflow normally only support a very minimal compression types (only zlib/gzip, not just for TFRecordDataset but also for other dataset format as well). For tensorflow-io, it is served as a package to hold additional file format and extensions that is not part of the tensorflow core package.
@rllin any update on this?