litdata Local cache dir not fully clearing in DDP multi-node training.

🐛 Bug

At the moment it seems that a significant portion of data stored in the cache (~40-60%) ends up not removed over the course of training, and then remains in the cache dir upon completion of training. I suspect this

To Reproduce

Currently training a few models over 8 nodes with a large source dataset (only 1 epoch), and the cache dir size accumulates indefinitely. Over the process, it seems that the lock counts for these files are much greater than 0, so I wonder if it has to do with force_download-related behavior?

Expected behavior

The cache dir does not expand significantly above its target size, and files aren't being locked as many times.

Additional context

Environment detail

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Mar 12 '25 18:03 JackUrb

Hey @JackUrb. Could you try to disable force download to see if it helps ?

Mar 13 '25 12:03 tchaton

Unfortunately if we disable force download, we end up stalling out in multiple places, and the run has a high chance of failing outright early in changing.

Perhaps the (band-aid) solution when running into a force download is to run the download without increasing the lock count, as supposedly if you arrive in a situation where you would be calling the force, it's because you expected it to be present, meaning you expect to have already incremented the lock?

Mar 14 '25 08:03 JackUrb

Hey @JackUrb. Good idea.

Mar 22 '25 16:03 tchaton

Hi @JackUrb, we recently added a few fixes related to cache clearance. It's not released yet, but available on the main branch—feel free to give it a try and let us know how it goes.

Thanks!

Apr 13 '25 10:04 bhimrazy

Hi @bhimrazy - launching a new run on litdata main this week, will report back.

Apr 16 '25 17:04 JackUrb

Hi @bhimrazy - launching a new run on litdata main this week, will report back.

Thank you @JackUrb

Apr 16 '25 17:04 bhimrazy

@bhimrazy Unfortunately no dice - we end up with an overfull cache directory containing all of the dat-0-x.bin from with x between 0 to 40, and then around 1 in 10 of the next 100,000 shards.

The first 40 end up having a .cnt of ~22+/-1, and almost all of the rest have 1 with an occasional 2. I suspect the greater in the first ones to be caused by using the same dataset for val as for train. Two separate runs used the same cache, so the 1's and occasional 2 makes sense to me as some race condition occurring roughly 1 in 20 shards.

Apr 21 '25 14:04 JackUrb

Thank you, @JackUrb, for the detailed follow-up.

If possible, could you also share a bit more about your training setup? Specifically:

Number of nodes
Number of CPU cores and devices per node
Number of workers allocated per process
Cache size allocated per node (in GB)
Total dataset size (in GB/TB)
Remaining cache size at the end of epoch (in GB)
Any other details you think might help with debugging

This info would be really helpful as we try to replicate and test the issue on smaller-scale setups. Really appreciate your time and insights!

Apr 21 '25 17:04 bhimrazy

Hi @bhimrazy, of course, here's what I've got!

Number of nodes: 4
Number of CPU cores and devices per node: 8 GPUs per node, 20 CPUs per GPU
Number of workers allocated per process: 8-10 workers in a StreamingDataLoader
Cache size allocated per node (in GB): 100GB allocated globally, all pointing to the same cache dir
Total dataset size (in GB/TB): ~2.2TB
Remaining cache size at the end of epoch (in GB): ~900GB
Any other details you think might help with debugging:
- Shard size is relatively smaller than we'd like at ~4.1Mb per (still fixing a bug on our export)
- Our StreamingDataLoader always wraps a CombinedStreamingDataset even when there's only one dataset as our tooling is set up for us to be able to handle complex mixing experiments.

Apr 23 '25 23:04 JackUrb

Thank you again, @JackUrb — this is really helpful.

Just a few more clarifications if you don't mind:

Could you share the approximate number of StreamingDatasets used in the combined dataset, along with their average sizes? Also, do all of them individually have max_cache_size set to 100GB?
You mentioned “100GB allocated globally, all pointing to the same cache dir” — just to confirm, does this mean that all nodes are sharing a single cache directory, possibly mounted over NFS/EFS?

In most setups, the cache directory is maintained locally on each node and not shared across nodes — just wanted to double-check how it's configured in your case.

Sample
You noted that shard size is ~4.1MB — just to confirm, does this mean each individual data chunk is around that size?

Apr 24 '25 03:04 bhimrazy

In this case, we're using just one
It's globally allocated and pointed to a directory on a shared (network) drive. We were having issues with the drive size per-machine prior, though eventually we'll be resolving this by changing our scheduler.
Yeah each individual chunk is around 4.1MB

Apr 24 '25 23:04 JackUrb

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jul 19 '25 06:07 stale[bot]