AdaMerging Couldn't Reproduce the Code

Hello,

I attempted to reproduce the code, but encountered some issues. Could you please provide some insights into how much memory is expected to be used? Additionally, I suspect there might be a memory leak.

dataset_name:SUN397 torch.cuda.memory_allocated:6.21 GB 0%|▏ | 1/503 [00:01<14:33, 1.74s/it] dataset_name:Cars torch.cuda.memory_allocated:11.91 GB 0%|▎ | 1/394 [00:01<08:39, 1.32s/it] dataset_name:RESISC45 torch.cuda.memory_allocated:17.61 GB 1%|▋ | 1/169 [00:01<03:21, 1.20s/it] dataset_name:EuroSAT torch.cuda.memory_allocated:23.31 GB Using downloaded and verified file: /home/monody/AdaMerging/dataset/svhn/train_32x32.mat Using downloaded and verified file: /home/monody/AdaMerging/dataset/svhn/test_32x32.mat 0%| | 1/1627 [00:01<38:39, 1.43s/it] dataset_name:SVHN torch.cuda.memory_allocated:29.01 GB 0%| | 0/790 [00:00<?, ?it/s] torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 31.74 GiB total capacity; 29.65 GiB already allocated; 25.12 MiB free; 31.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thank you for your assistance.

May 22 '24 07:05 monody1

Hi,

Thank you very much for your interest in our work.

Which architecture are you currently merging? ViT-B/32, ViT-B/16 or ViT-L/14? I remember when I experimented, the ViT-B/32 and ViT-B/16 could be executed on a single 3090 GPU (i.e., 24G).

If you just want to evaluate, you can directly load my trained merge coefficients, which can be found at merging_cofficient.py.

Best, Enneng

May 22 '24 08:05 EnnengYang

I have been trying the ViT-B/16 architecture on 8 datasets using V100s GPUs (32G) with the main_task_wise_adamerging method. However, I've observed some issues at data loading. Additionally, regarding the loss calculation in an unsupervised setting with 8 datasets: is it correct to understand that the unsupervised loss accumulates across the 8 batches from each dataset, then performs a backward pass? Also, is the order of these 8 batches fixed during unsupervised training?

May 22 '24 09:05 monody1

Hi,

ViT-B/16 doesn't seem to take up much memory, as the checkpoint file for each dataset is only 426.55MB. Can you run with ViT-B/32?

Or is it because you adjusted the batch size? For training coefficients, I default to 16.

On each iteration/step, the unlabeled test set in the code is re-loaded (and the Shuffle function is used), so the batch data is not fixed for each iteration.

Best, Enneng

May 22 '24 09:05 EnnengYang

I am using the default batch size of 16, but the code crashes during the data loading phase. The issue seems to occur at

x = data['images'].to(args.device)
y = data['labels'].to(args.device)
outputs = adamerging_mtl_model(x, dataset_name)

is this proceed right?

I have summarized the CUDA memory usage here.

for epoch in range(epochs):
    losses = 0.
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)
        data = next(iter(dataloader))
        data = maybe_dictionarize(data)
        x = data['images'].to(args.device)
        y = data['labels'].to(args.device)

        outputs = adamerging_mtl_model(x, dataset_name)
        loss = softmax_entropy(outputs).mean(0)
        losses += loss

        print(dataset_name)
        print(f'{(torch.cuda.memory_allocated()/1024/1024/1024):.2f} GB')   #collect mem usage
    optimizer.zero_grad()
    losses.backward()
    optimizer.step()

Details

`python main_task_wise_adamerging.py TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/SUN397/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/Cars/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/RESISC45/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/EuroSAT/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/SVHN/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/GTSRB/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/MNIST/finetuned.pt TaskVector:/home/monody/AdaMerging/checkpoints/ViT-B-16/DTD/finetuned.pt Classification head for ViT-B-16 on SUN397 exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_SUN397.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_SUN397.pt Classification head for ViT-B-16 on Cars exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_Cars.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_Cars.pt Classification head for ViT-B-16 on RESISC45 exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_RESISC45.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_RESISC45.pt Classification head for ViT-B-16 on EuroSAT exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_EuroSAT.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_EuroSAT.pt Classification head for ViT-B-16 on SVHN exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_SVHN.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_SVHN.pt Classification head for ViT-B-16 on GTSRB exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_GTSRB.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_GTSRB.pt Classification head for ViT-B-16 on MNIST exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_MNIST.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_MNIST.pt Classification head for ViT-B-16 on DTD exists at /home/monody/AdaMerging/checkpoints/ViT-B-16/head_DTD.pt Loading classification head from /home/monody/AdaMerging/checkpoints/ViT-B-16/head_DTD.pt init lambda: tensor([[1.0000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000]], grad_fn=) collect_trainable_params: [Parameter containing: tensor([[0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000, 0.3000]], requires_grad=True)] 0%| | 1/1243 [00:02> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

May 22 '24 09:05 monody1

It can run by making this change, but the memory usage is still inefficient.

for epoch in  #range(epochs):
losses = 0.
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)
        data = next(iter(dataloader))  #get one batch
        data = maybe_dictionarize(data)
        x = data['images'].to(args.device)
        y = data['labels'].to(args.device)
        outputs = adamerging_mtl_model(x, dataset_name)
        loss = softmax_entropy(outputs).mean(0)
        losses += loss
        print(dataset_name)
        print(f'{(torch.cuda.memory_allocated()/1024/1024/1024):.2f} GB')
    optimizer.zero_grad()
    losses.backward()
    optimizer.step()

May 22 '24 09:05 monody1

I am using the default batch size of 16, but the code crashes during the data loading phase. The issue seems to occur at

x = data['images'].to(args.device)
y = data['labels'].to(args.device)
outputs = adamerging_mtl_model(x, dataset_name)

is this proceed right?

I have summarized the CUDA memory usage here.

for epoch in range(epochs):
    losses = 0.
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)
        data = next(iter(dataloader))
        data = maybe_dictionarize(data)
        x = data['images'].to(args.device)
        y = data['labels'].to(args.device)

        outputs = adamerging_mtl_model(x, dataset_name)
        loss = softmax_entropy(outputs).mean(0)
        losses += loss

        print(dataset_name)
        print(f'{(torch.cuda.memory_allocated()/1024/1024/1024):.2f} GB')   #collect mem usage
    optimizer.zero_grad()
    losses.backward()
    optimizer.step()

Details

Hi,

Summing the losses across multiple datasets is correct when doing backpropagation and updating the parameters.

May 22 '24 10:05 EnnengYang

It can run by making this change, but the memory usage is still inefficient.

for epoch in  #range(epochs):
losses = 0.
    for dataset_name in exam_datasets:
        dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16)
        dataloader = get_dataloader_shuffle(dataset)
        data = next(iter(dataloader))  #get one batch
        data = maybe_dictionarize(data)
        x = data['images'].to(args.device)
        y = data['labels'].to(args.device)
        outputs = adamerging_mtl_model(x, dataset_name)
        loss = softmax_entropy(outputs).mean(0)
        losses += loss
        print(dataset_name)
        print(f'{(torch.cuda.memory_allocated()/1024/1024/1024):.2f} GB')
    optimizer.zero_grad()
    losses.backward()
    optimizer.step()

It is true that reloading the dataset per iteration is not efficient. But due to RAM limitations, I can't keep all the dataloaders in memory, so I have to read them separately each iteration.

A simple modifiable solution would be to remove all the training dataloaders, since we won't be using the training set for our project, and only access the test set.

May 22 '24 10:05 EnnengYang

你好我还是没有复现出ViT-B/16的效果, 每次进入这个循环

for dataset_name in exam_datasets:

dataloader重新初始化然后获得一个batch 对吗？是不是等价于每个epoch都是从8个数据集里重新采样的，也就是在epoch_i 中 batch from dataset_A 和 epoch_j 中 batch from dataset_A 中的样本是可以重复的对吗? 目前λ在500 epoch下[[1.0000, 0.1601, 0.0774, 0.0510, 0.0546, 0.0422, 0.0992, 0.0571, 0.7357]] 还没有接近在merging_cofficient.py 给出的值 [[1.0000, 0.1916, 0.1585, 0.2502, 0.3093, 0.2544, 0.3543, 0.2172, 0.1538]] 而且在8个数据集的avg acc (Eval: Epoch: 499 Avg ACC:0.6502134075542055) 是下降趋势的能给出更多的细节吗？

谢谢

Jun 03 '24 07:06 monody1

您好，

for epoch in range(epochs):\\ for dataset_name in exam_datasets: \\ dataset = get_dataset(dataset_name, pretrained_model.val_preprocess, location=args.data_location, batch_size=16) \\ dataloader = get_dataloader_shuffle(dataset)

即具体实现中，每次进入循环时都重新获取data_loader(其中data_loader中加载时会对数据shuffle)，例如datasets/mnist.py中shuffled dataloader为：

self.test_loader_shuffle = torch.utils.data.DataLoader( self.test_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers )

也就是说，每个迭代是从该数据集对应的test datasets里随机采样一个batch，由于随机采样，那么多个迭代里采样到的数据可能会出现少量重叠，但不会完全重叠。

总之，AdaMerging中合并参数优化时，会从每个数据集中随机抽一个Batch的数据出来计算Loss，然后根据加和的Loss计算梯度并更新合并系数。

祝好

Jun 04 '24 03:06 EnnengYang