datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Multi-threaded compression?

Open khdlr opened this issue 1 year ago • 1 comments

What I need help with / What I was wondering I need to build a large dataset of imagery that has > 3 channels (multi-spectral satellite imagery), so I'm relying on the tfds.features.Tensor feature connector. As writing data uncompressed is highly inefficient, I'm using tfds.features.Encoding.ZLIB for compression.

However, this compression step actually becomes the bottleneck in my dataset building process as it is single-threaded, causing my dataset build to take longer than a month.

What I've tried so far Read up on the docs, also checked the tf.io namespace for any possible workarounds.

It would be nice if...

  • Is there any way of speeding up the encoding/compression of the examples by using multiple cores?
  • Are there plans to support a faster compression method than ZLIB for generic Tensor features?

khdlr avatar Mar 28 '24 08:03 khdlr

same problem when preparing tfrecord before training

noahzhy avatar Apr 02 '24 14:04 noahzhy