Saaketh Narayan
Saaketh Narayan
@cabreraalex Mind addressing the review comments when you have some time? Thanks!
@cabreraalex Mind adding the test that @ethantang-db mentioned and we can get this in?
Hey @cabreraalex, that's a nice idea. If possible, would you be able to submit an example PR? we always encourage community contributions :)
Couple things: * How are you verifying that duplicate shards are being downloaded between nodes? Streaming explicitly partitions shard files between nodes so the degree of duplication should be pretty...
That makes sense! thanks for investigating. `device_per_stream` is a newer batching method and so is not completely download-optimal. Some download optimization has been implemented to prevent massive levels of duplication,...
@XiaohanZhangCMU some linting errors
@elbamos As mentioned, torchrun or torch distributor work with StreamingDataset, in addition to Composer. From a Databricks notebook, torch distributor should make launching your job easy. @jbohnslav Regarding: > If...
@AugustDev You filed #781, correct? @XiaohanZhangCMU's recommendations there make sense to me -- you can see the currently running processes with `top` and kill them. Then clear your stale shared...
@jamin-chen Great question -- yes this should be the case since StreamingDataset tracks the cache usage even for locally present shards. Are you seeing behavior contrary to this?
@jamin-chen Sorry for the delay in responding to this. So in `StreamingDataset`'s `prepare_shard` function [here](https://github.com/mosaicml/streaming/blob/32caef202f69c6be3f424956da751981f7143fa5/streaming/base/dataset.py#L1122), all shard states should start as `REMOTE`. Then, the particular `Stream`'s `prepare_shard` function is called...