open_clip
open_clip copied to clipboard
CPU memory leak when training with CsvDataset-like dataset
Hi, I tried to train with siglip loss on a large dataset and I found that during training (not evaluation), CPU memory usage kept increasing. The program was finally killed by the system. The data loading process is nothing special, similar to that of what csv dataset does. Has anyone encountered a similar problem?
@estherxue does it behave differently than with normal CLIP (infonce) loss on the exact setup?
@estherxue does it behave differently than with normal CLIP (infonce) loss on the exact setup?
Below is the running script for siglip:
torchrun --nproc_per_node 1 \ --nnodes $WORLD_SIZE \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT \ -m training.main \ --train-data '' \ --val-data '' \ --dataset-type hugface \ --batch-size 512 \ --precision amp \ --csv-img-key image_hash \ --csv-caption-key caption \ --local-loss \ --gather-with-grad \ --logs /home/work/data_mm_pretrain/models/siglip_b16_60m_large_bs_no_wd/ \ --name large_bs \ --workers 12 \ --epochs 10 \ --model ViT-B-16-SigLIP \ --pretrained webli \ --warmup 0 \ --beta2 0.95 \ --lr 5e-5 \ --wd 0. \ --torchcompile \ --siglip
Below is the running script for clip:
torchrun --nproc_per_node 1 \ --nnodes $WORLD_SIZE \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT \ -m training.main \ --train-data '' \ --val-data '' \ --dataset-type hugface \ --batch-size 768 \ --precision amp \ --csv-img-key image_hash \ --csv-caption-key caption \ --local-loss \ --gather-with-grad \ --logs /home/work/data_mm_pretrain/models/clip_b32_id_2.5m_baseline/ \ --name large_bs \ --workers 12 \ --epochs 4 \ --model ViT-B-32-quickgelu \ --pretrained openai \ --use-thumbnail \ --warmup 0 \ --lr 5e-5 \ --wd 0. \ --torchcompile
The only difference that I can notice is the memory usage on CPU.
did you tried training with standard clip loss?
Hi, are there any updates?
did you tried training with standard clip loss?
I tried training with standard clip loss. I was wrong. Standard clip loss also has memory leak problem.
Hi, are there any updates?
I finally solved this problem by only training with limited batch number for each epoch as after the code finishes training for one epoch, the memory usage goes down.
It seems that this has nothing to do with the loss. The memory leak exists when doing evaluation.
I did some research on CPU memory leaks, and people say most of the time memory leaks appear when tensors are accumulated without being detached (as they carry with them the entire computational graph) or data loader issues such as copy-on-access (storing in the dataset definition naive python objects which reference count gets increased when being accessed by multiple processes (dataloader workers))
These resources might help you to debug:
- https://github.com/pytorch/pytorch/issues/13246
- https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/#Motivation-for-In-RAM-Data
If you find something let me know, I am experiencing RAM memory leaks too when fine-tuning CLIP using LoRA and standard distributed clip loss implemented in this repo, but I use torch ligthing Fabric launcher and mosaic ml streaming dataset instead of WebDataset
we've done a lot of large scale training, long durations, big datasets and never found any noteworthy issues with dataloader memory leaks and the webdataset code. We don't use csv datasets though so possibly an issue there.
There is significant memory churn when you're plowing through really large datasets. Some allocators have issues with fragmentation over time. I usually patch the allocator to use tcmalloc. LD_PRELOAD=/lib/<system dependent>/libtcmalloc.so.4
... apt get install google-perftools
to get the lib.
Should point out that normal 'validation' is VERY memory intensive if you have a lot of samples in your val dataset, it should be treated as a 'gallery' style dataset where it's a hand picked limited set of test samples as it does a full similarity matrix. That could really spike memory. We usually use zero-shot eval to gauge progress as it's more sane to run across larger val sets and often the metric that most focus on (though valid arguments for preferring other val metrics too).
A batch-wise (average over batched similarities) would be possible but not implemented.
Using CSV datasets with the native implementation will lead to an increase in memory. As @miguelalba96 linked, it is not a bug but expected behavior. The solution is either
- use a different dataset format like webdataset. This will be streaming though and not map style.
- if you want map style with minimal changes then you can move to another structure that has no copy-on-read. Pyarrow has worked in the past.