James Knighton

Results 27 comments of James Knighton

https://github.com/mosaicml/streaming/pull/570 is merged.

How can we do this while still keeping the local dir collision check across all users?

Oh, no! A serialized streaming dataset consists of two things, which must be in agreement: 1. an `index.json` file which contains essential metadata about each shard (including how many samples...

Hey, sorry I missed this. > Perhaps the write to S3 failed somehow Yeah, my best guess is there may have been an exception during multi-threaded shard uploading (during writing...

If this is helpful for debugging, note that an index.json file is just a list of dicts, where each dict is the metadata for a shard like filename and how...

Seems like option 1 is easier, but you would of course incur significant and potentially variable sample loading times. Perhaps you could smooth that out by raising DataLoader prefetch factor,...

Yeah sorry, there is a bit of object-orientation consternation in that Writer code, it's safe to force flush shards. We endeavor to have things crash quickly and loudly when things...

Thanks for trying Streaming. We haven't done a whole lot with the video modality, and have not seen this particular use case before, caveat emptor: > Two dataloaders and a...

Ah, you have a point there. Would it be enough to use the official temp root of your operating system (say, `os.path.join(tempfile.gettempdir(), 'streaming')` IIRC?) If not, what's your use case...

> After modifying both self._filelock_root in dataset.py and root in stream.py, the scripts can successfully executed! But I still want to confirm the correctness with you. Is there anything else...