Tom comments

Results 170 comments of

Tom

[REQUEST] rename with_epoch() method of WebDataset

Perhaps. But that would be an incompatible change. Sometimes people make bad picks for names and we are stuck with it. I'll try to improve the documentation.

Fix typo in init.py and add version information

version information is updated by "invoke newversion". That was broken, but I have fixed it now.

Multinode documentation

Note that resampling after splitting results in slightly uneven sample probabilities. The ResampledShards implementation works great for large scale training with fast object stores. This is the case on high...

ddp_equalize

Sorry, the FAQ was wrong. There are two methods: - `with_length(n)` adds a `__len__` method to the pipeline so that len(dataset) returns n. It does not actually change anything about...

ddp_equalize

The value 2 should be good enough. If your dataset is so unbalanced that 2 is not good enough, you'll get an error, and that's my preferred behavior.But you can...

Add ability to re-create bit-exact Webdatasets

Sorry this took so long. I have added an mtime option to the ShardWriter class, so you can set that to any floating point value you want. I also have...

Allow writing tiffs using TarWriter

Yes, adding TIFF decoding/encoding is easy for users to do. The reason it isn't done by default is because TIFF has many format variants and it is difficult to decode...

WebDataset no longer has a length argument (documentation)

Updated the documentation.

Guidance on evaluating shard count/size, WebDataset length

You are not repeating your training data infinitely, so this won't work. If you want exactly one permutation of the training data per epoch and you want this distributed equally...

Unexpected Shuffling Behavior

You can use RandomMix to mix sources with arbitrary probabilities, or you can use MultiShardSmaple to sample at the shard level.