learning-notes icon indicating copy to clipboard operation
learning-notes copied to clipboard

TensorFlow 中为什么要将 TFRecords 分成若干个 shards?

Open daa233 opened this issue 6 years ago • 0 comments

最主要的原因是避免存储、读取大文件,分成若干个 shards 更高效。

When we say shards in data generator (t2t-datagen) it just means that we split large files into a number of smaller files. It's usually better to not have gigabyte-sized files, and reading from multiple files can be faster, that's why we do it. And yes, you can use 1 shard for 100k sentences, though I think having 10 is still fine too. By lukaszkaiser

参考资料:

  1. What is the relationship between shards and minibatches in the wmt example?
  2. What is the benefit of splitting tfrecord file into shards?

daa233 avatar Jul 25 '18 03:07 daa233