learning-notes TensorFlow 中为什么要将 TFRecords 分成若干个 shards？

TensorFlow 中为什么要将 TFRecords 分成若干个 shards？

Open daa233 opened this issue 6 years ago • 0 comments

最主要的原因是避免存储、读取大文件，分成若干个 shards 更高效。

When we say shards in data generator (t2t-datagen) it just means that we split large files into a number of smaller files. It's usually better to not have gigabyte-sized files, and reading from multiple files can be faster, that's why we do it. And yes, you can use 1 shard for 100k sentences, though I think having 10 is still fine too. By lukaszkaiser

参考资料：

What is the relationship between shards and minibatches in the wmt example?
What is the benefit of splitting tfrecord file into shards?

Jul 25 '18 03:07 daa233

learning-notes learning-notes copied to clipboard

TensorFlow 中为什么要将 TFRecords 分成若干个 shards？

learning-notes
learning-notes copied to clipboard