nobrainer icon indicating copy to clipboard operation
nobrainer copied to clipboard

Shard size is automatically determined to produce ~100MB tfrecords files

Open ohinds opened this issue 1 year ago • 3 comments

According to the tensorflow user guide, tfrecords files should be ~100MB (https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/performance/overview.md). When tfrecords datasets are constructed from files, the shard size could be automatically computed to follow this guidance.

ohinds avatar Aug 25 '23 20:08 ohinds

100MB doesn't make sense on fast disk systems like we have on openmind or for brain imaging data. i believe we have played with TB sized shards as well. i would make this a user controllable parameter.

satra avatar Aug 25 '23 21:08 satra

Well, the default currently produces tfrecord files sizes of about 20MB, so that makes even less sense. I'm suggesting an automatically-determined default, with the facility for people to override if the want something else.

Also, specifying a shard size in bytes makes way more sense than number of examples, as it currently is.

ohinds avatar Aug 25 '23 21:08 ohinds

Probably a combination of du -hL /path/to/data and this might do?

hvgazula avatar Mar 30 '24 00:03 hvgazula