streaming
streaming copied to clipboard
Will cache eviction logic take previously-existing shards into account?
If I run training twice with the same local cache directory, will the second run be able to evict shards that were downloaded by the first run?
@jamin-chen Great question -- yes this should be the case since StreamingDataset tracks the cache usage even for locally present shards. Are you seeing behavior contrary to this?
Maybe, but there are a few confounding factors on our end so I need to test further. In the meantime could you point me to the code where StreamingDataset lists all pre-existing shards in the local directory?
@jamin-chen Sorry for the delay in responding to this. So in StreamingDataset's prepare_shard function here, all shard states should start as REMOTE. Then, the particular Stream's prepare_shard function is called here, which fetches the shard and also returns the delta of the increase in cache limit space (see here). Importantly, this code path is followed even for locally present shard files, so the delta in cache usage that is returned by prepare_shard is the same even if the file is already locally present. So the cache usage of locally present shards should be correctly tracked.
Does this help with your custom use case / issues you're seeing?