nobrainer
nobrainer copied to clipboard
Shard size is automatically determined to produce ~100MB tfrecords files
According to the tensorflow user guide, tfrecords files should be ~100MB (https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/performance/overview.md). When tfrecords datasets are constructed from files, the shard size could be automatically computed to follow this guidance.
100MB doesn't make sense on fast disk systems like we have on openmind or for brain imaging data. i believe we have played with TB sized shards as well. i would make this a user controllable parameter.
Well, the default currently produces tfrecord files sizes of about 20MB, so that makes even less sense. I'm suggesting an automatically-determined default, with the facility for people to override if the want something else.
Also, specifying a shard size in bytes makes way more sense than number of examples, as it currently is.
Probably a combination of du -hL /path/to/data
and this might do?