open_clip CPU memory leak when training with CsvDataset-like dataset

Hi, I tried to train with siglip loss on a large dataset and I found that during training (not evaluation), CPU memory usage kept increasing. The program was finally killed by the system. The data loading process is nothing special, similar to that of what csv dataset does. Has anyone encountered a similar problem?

Mar 31 '24 03:03 estherxue

@estherxue does it behave differently than with normal CLIP (infonce) loss on the exact setup?

Apr 05 '24 18:04 rwightman

@estherxue does it behave differently than with normal CLIP (infonce) loss on the exact setup?

Below is the running script for siglip: torchrun --nproc_per_node 1 \ --nnodes $WORLD_SIZE \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT \ -m training.main \ --train-data '' \ --val-data '' \ --dataset-type hugface \ --batch-size 512 \ --precision amp \ --csv-img-key image_hash \ --csv-caption-key caption \ --local-loss \ --gather-with-grad \ --logs /home/work/data_mm_pretrain/models/siglip_b16_60m_large_bs_no_wd/ \ --name large_bs \ --workers 12 \ --epochs 10 \ --model ViT-B-16-SigLIP \ --pretrained webli \ --warmup 0 \ --beta2 0.95 \ --lr 5e-5 \ --wd 0. \ --torchcompile \ --siglip

Below is the running script for clip: torchrun --nproc_per_node 1 \ --nnodes $WORLD_SIZE \ --node_rank $RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT \ -m training.main \ --train-data '' \ --val-data '' \ --dataset-type hugface \ --batch-size 768 \ --precision amp \ --csv-img-key image_hash \ --csv-caption-key caption \ --local-loss \ --gather-with-grad \ --logs /home/work/data_mm_pretrain/models/clip_b32_id_2.5m_baseline/ \ --name large_bs \ --workers 12 \ --epochs 4 \ --model ViT-B-32-quickgelu \ --pretrained openai \ --use-thumbnail \ --warmup 0 \ --lr 5e-5 \ --wd 0. \ --torchcompile The only difference that I can notice is the memory usage on CPU.

Apr 08 '24 02:04 estherxue

did you tried training with standard clip loss?

Apr 13 '24 17:04 miguelalba96

Hi, are there any updates?

Apr 15 '24 05:04 darkasevgen

did you tried training with standard clip loss?

I tried training with standard clip loss. I was wrong. Standard clip loss also has memory leak problem.

Apr 19 '24 10:04 estherxue

Hi, are there any updates?

I finally solved this problem by only training with limited batch number for each epoch as after the code finishes training for one epoch, the memory usage goes down.

Apr 19 '24 10:04 estherxue

It seems that this has nothing to do with the loss. The memory leak exists when doing evaluation.

Apr 26 '24 04:04 estherxue

I did some research on CPU memory leaks, and people say most of the time memory leaks appear when tensors are accumulated without being detached (as they carry with them the entire computational graph) or data loader issues such as copy-on-access (storing in the dataset definition naive python objects which reference count gets increased when being accessed by multiple processes (dataloader workers))

These resources might help you to debug:

https://github.com/pytorch/pytorch/issues/13246
https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/#Motivation-for-In-RAM-Data

If you find something let me know, I am experiencing RAM memory leaks too when fine-tuning CLIP using LoRA and standard distributed clip loss implemented in this repo, but I use torch ligthing Fabric launcher and mosaic ml streaming dataset instead of WebDataset

Apr 26 '24 08:04 miguelalba96

we've done a lot of large scale training, long durations, big datasets and never found any noteworthy issues with dataloader memory leaks and the webdataset code. We don't use csv datasets though so possibly an issue there.

There is significant memory churn when you're plowing through really large datasets. Some allocators have issues with fragmentation over time. I usually patch the allocator to use tcmalloc. LD_PRELOAD=/lib/<system dependent>/libtcmalloc.so.4 ... apt get install google-perftools to get the lib.

Should point out that normal 'validation' is VERY memory intensive if you have a lot of samples in your val dataset, it should be treated as a 'gallery' style dataset where it's a hand picked limited set of test samples as it does a full similarity matrix. That could really spike memory. We usually use zero-shot eval to gauge progress as it's more sane to run across larger val sets and often the metric that most focus on (though valid arguments for preferring other val metrics too).

A batch-wise (average over batched similarities) would be possible but not implemented.

May 09 '24 22:05 rwightman

Using CSV datasets with the native implementation will lead to an increase in memory. As @miguelalba96 linked, it is not a bug but expected behavior. The solution is either

use a different dataset format like webdataset. This will be streaming though and not map style.
if you want map style with minimal changes then you can move to another structure that has no copy-on-read. Pyarrow has worked in the past.

May 14 '24 20:05 jn2clark

open_clip open_clip copied to clipboard

CPU memory leak when training with CsvDataset-like dataset

open_clip
open_clip copied to clipboard