streaming icon indicating copy to clipboard operation
streaming copied to clipboard

Will cache eviction logic take previously-existing shards into account?

Open jamin-chen opened this issue 1 year ago • 3 comments

If I run training twice with the same local cache directory, will the second run be able to evict shards that were downloaded by the first run?

jamin-chen avatar Dec 05 '24 19:12 jamin-chen

@jamin-chen Great question -- yes this should be the case since StreamingDataset tracks the cache usage even for locally present shards. Are you seeing behavior contrary to this?

snarayan21 avatar Dec 05 '24 19:12 snarayan21

Maybe, but there are a few confounding factors on our end so I need to test further. In the meantime could you point me to the code where StreamingDataset lists all pre-existing shards in the local directory?

jamin-chen avatar Dec 05 '24 19:12 jamin-chen

@jamin-chen Sorry for the delay in responding to this. So in StreamingDataset's prepare_shard function here, all shard states should start as REMOTE. Then, the particular Stream's prepare_shard function is called here, which fetches the shard and also returns the delta of the increase in cache limit space (see here). Importantly, this code path is followed even for locally present shard files, so the delta in cache usage that is returned by prepare_shard is the same even if the file is already locally present. So the cache usage of locally present shards should be correctly tracked.

Does this help with your custom use case / issues you're seeing?

snarayan21 avatar Jan 04 '25 05:01 snarayan21