Tom

Results 170 comments of Tom

The reason there is not globbing support is that there is no consistent API for it in different object stores. All we can consistently support is "read the contents of...

I recommend using the v2 branch. Among other things, in v2, node and worker splitting are explicit. There is a backwards compatible wrapper, so the switch should be easy. ```Python...

1. Yes, correct, UNIX IPC is probably not available on Windows. You can modify the protocol in the source code; I'll see whether I can make it configurable. 2. I'll...

Done. TODO: add non-pytorch/tf integration tests

This may be a bug; let me see whether I can reproduce it. The combination of resampled/with_epoch should always give rise to an epoch of the exact length. Could you...

As far as WebDataset is concerned, the length does not matter, anywhere. WebDataset just iterates through data sources until it reaches the end and then raises a StopIteration. You can...

I'll add something to the documentation. There isn't really anything WebDataset can do to fix it: once `s3cmd` mixes the outputs, WebDataset can't unmix them. Programs like s3cmd shouldn't print...

You can turn any object with an `__iter__` method into a `Processor`, so the following should work: ``` connection = Connection(...) # or TensorcomDataset(...) src = Processor(connection, utils.identity) rebatched =...

Instead of a string, you can pass a list, so you can just write: ```Python shards = list(glob(pattern)) WebDataset(shards) ``` (TODO: add support for iterables in addition to lists)

Right now, your records are 4144 bytes large and each record requires 5636 bytes to store, so you have about a 25% space and performance overhead. You should be able...