Saaketh Narayan comments

Results 96 comments of


                                            Saaketh Narayan

add jpeg quality option

@cabreraalex Mind addressing the review comments when you have some time? Thanks!

add jpeg quality option

@cabreraalex Mind adding the test that @ethantang-db mentioned and we can get this in?

Choose JPEG compression level

Hey @cabreraalex, that's a nice idea. If possible, would you be able to submit an example PR? we always encourage community contributions :)

Download optimal for device_per_stream batching method.

Couple things: * How are you verifying that duplicate shards are being downloaded between nodes? Streaming explicitly partitions shard files between nodes so the degree of duplication should be pretty...

Download optimal for device_per_stream batching method.

That makes sense! thanks for investigating. `device_per_stream` is a newer batching method and so is not completely download-optimal. Some download optimization has been implemented to prevent massive levels of duplication,...

Add upper bound for prefix_int

@XiaohanZhangCMU some linting errors

File exists: '/000000_epoch_shape' when using the ddp strategy from pytorch lightning

@elbamos As mentioned, torchrun or torch distributor work with StreamingDataset, in addition to Composer. From a Databricks notebook, torch distributor should make launching your job easy. @jbohnslav Regarding: > If...

File exists: '/000000_epoch_shape' when using the ddp strategy from pytorch lightning

@AugustDev You filed #781, correct? @XiaohanZhangCMU's recommendations there make sense to me -- you can see the currently running processes with `top` and kill them. Then clear your stale shared...

Will cache eviction logic take previously-existing shards into account?

@jamin-chen Great question -- yes this should be the case since StreamingDataset tracks the cache usage even for locally present shards. Are you seeing behavior contrary to this?

Will cache eviction logic take previously-existing shards into account?

@jamin-chen Sorry for the delay in responding to this. So in `StreamingDataset`'s `prepare_shard` function [here](https://github.com/mosaicml/streaming/blob/32caef202f69c6be3f424956da751981f7143fa5/streaming/base/dataset.py#L1122), all shard states should start as `REMOTE`. Then, the particular `Stream`'s `prepare_shard` function is called...