cords icon indicating copy to clipboard operation
cords copied to clipboard

Gradmatch Data subset selection method making training slow

Open animesh-007 opened this issue 2 years ago • 9 comments

I tried to run some experiments as follows:

  • Ran full cifar10 without any subset selection method to train resnet50 which took around 32m 31s.
  • Ran Gradmatch cifar10 subset selection with 0.1 fractions taking longer time than full cifar10 i.e 22h 48m 40s.
  • Ran Gradmatch cifar10 subset selection with 0.3 fractions taking longer time than 0.1 Gradmatch selection method.

I am using scaled resolution images of cifar10 i.e 224x224 resolution and accordingly defined resnet50 architecture. Can you let me know how to speed up experiments 2 and 3? In general subset selection method should faster the whole training process right?

animesh-007 avatar Jun 27 '22 19:06 animesh-007

@animesh-007

Can you point out what version of GradMatch you are using?

Ideally, subset selection should be faster unless something is wrong with the experimental setup. Please attach the log files so that I can figure out the issue after analyzing them.

krishnatejakk avatar Jul 14 '22 16:07 krishnatejakk

@animesh-007

Can you point out what version of GradMatch you are using?

Ideally, subset selection should be faster unless something is wrong with the experimental setup. Please attach the log files so that I can figure out the issue after analyzing them.

@krishnatejakk These are the initial logs. Should I paste whole log? I cloned the repo on June 27. So I guess I am using the latest version.

[06/27 16:40:56] train_sl INFO: DotMap(setting='SL', is_reg=True, dataset=DotMap(name='cifar10', datadir='../storage', feature='dss', type='image'), dataloader=DotMap(shuffle=True, batch_size=256, pin_memory=True, num_workers=8), model=DotMap(architecture='ResNet50_224', type='pre-defined', numclasses=10), ckpt=DotMap(is_load=False, is_save=True, dir='results/', save_every=20), loss=DotMap(type='CrossEntropyLoss', use_sigmoid=False), optimizer=DotMap(type='sgd', momentum=0.9, lr=0.01, weight_decay=0.0005, nesterov=False), scheduler=DotMap(type='cosine_annealing', T_max=300), dss_args=DotMap(type='GradMatch', fraction=0.3, select_every=5, lam=0.5, selection_type='PerClassPerGradient', v1=True, valid=False, kappa=0, eps=1e-100, linear_layer=True), train_args=DotMap(num_epochs=300, device='cuda', print_every=1, results_dir='results/', print_args=['val_loss', 'val_acc', 'tst_loss', 'tst_acc', 'time'], return_args=[])) Files already downloaded and verified Files already downloaded and verified 18it [00:01, 10.12it/s] [06/27 16:41:12] train_sl INFO: Epoch: 1 , Validation Loss: 3.1551918701171875 , Validation Accuracy: 0.1914 , Test Loss: 3.5032728210449218 , Test Accuracy: 0.2142 , Timing: 7.0498366355896 15it [00:01, 10.10it/s] [06/27 16:41:21] train_sl INFO: Epoch: 2 , Validation Loss: 2.387578009033203 , Validation Accuracy: 0.3002 , Test Loss: 2.735808560180664 , Test Accuracy: 0.3253 , Timing: 6.075047492980957 1it [00:00, 6.72it/s] [06/27 16:41:31] train_sl INFO: Epoch: 3 , Validation Loss: 2.139058850097656 , Validation Accuracy: 0.3246 , Test Loss: 2.036042041015625 , Test Accuracy: 0.3344 , Timing: 6.058322191238403 8it [00:00, 10.55it/s] [06/27 16:41:41] train_sl INFO: Epoch: 4 , Validation Loss: 3.5549482177734375 , Validation Accuracy: 0.3576 , Test Loss: 2.480993505859375 , Test Accuracy: 0.3838 , Timing: 5.7214953899383545 9it [00:01, 8.89it/s] [06/27 16:41:50] train_sl INFO: Epoch: 5 , Validation Loss: 3.782627294921875 , Validation Accuracy: 0.3624 , Test Loss: 3.3407586791992188 , Test Accuracy: 0.39 , Timing: 5.925083160400391 12it [00:01, 10.56it/s] 4it [00:00, 8.48it/s] 5it [00:00, 10.51it/s] 15it [00:01, 11.37it/s] 16it [00:01, 11.82it/s] 18it [00:01, 13.24it/s] 11it [00:01, 9.24it/s] 7it [00:00, 7.66it/s] 2it [00:00, 12.04it/s] 15it [00:01, 11.98it/s] [06/27 16:58:59] train_sl INFO: Epoch: 6, GradMatch subset selection finished, takes 1028.8181.

animesh-007 avatar Jul 14 '22 17:07 animesh-007

@krishnatejakk I got the similar test results using cifar10 dataset and ResNet18 model. For one epoch training, "full dataset" took about 50 seconds, GradMatch and CRAIG took more than 100 seconds. Besides, GradMatch and CRAIG took about 100 seconds to select sub dataset in an epoch.

  1. Can we preprocess the whole dataset first to get the weighted training sub dataset, and then train directly with the weighted sub dataset, which should shorten the training time. Is there an example about that?

  2. Is there a faster sub dataset selection method? Thank you.

[Full dataset]: INFO: The length of dataloader: 2250 INFO: Training Timing: 50.17572069168091

[GradMatch]: INFO: The length of dataloader: 225 INFO: GradMatch subset selection finished, takes 99.8966. INFO: Training Timing: 104.97514295578003

[CRAIG]: INFO: The length of dataloader: 225 INFO: subset selection finished, takes 108.4812. INFO: Training Timing: 114.62646007537842

shiyf129 avatar Jul 21 '22 09:07 shiyf129

@shiyf129 What is the resolution of the images you are using while training? I am using 224x224.

animesh-007 avatar Jul 21 '22 10:07 animesh-007

@animesh-007 @shiyf129 I am working on the issue. We have recently updated the OMP version in GradMatch code which improves its performance further. However the new OMP version is making it slower in this case. I will debug why it is very slow in this case.

For faster training, One option is to use GradMatchPB (i.e., perBatch version) or revert back to previous OMP version in GradMatch strategy code below: https://github.com/decile-team/cords/blob/844f897ea4ed7e2f9c1453888022c281bb2091be/cords/selectionstrategies/SL/gradmatchstrategy.py#L6 In import statement, remove _V1 to revert back to previous version of OMP code

krishnatejakk avatar Jul 21 '22 13:07 krishnatejakk

@shiyf129 What is the resolution of the images you are using while training? I am using 224x224.

I use the original cifar10 dataset, 32*32 image size

shiyf129 avatar Jul 25 '22 01:07 shiyf129

@krishnatejakk I test GradMatchPB algorithm and set v1=False to use previous OMP version. I compared the beginning 10 epoch training between GradMatchPB alogithm and full dataset training, the result shows GradMatchPB takes longer time, and the average accuracy is relatively low. Do you know the reason about it?

GradMatchPB

  • the mean epoch training time is 26.70+32.06 = 58.76 seconds
  • the mean of accuracy is 0.463

Full dataset training

  • the mean epoch training time is 50.867 seconds
  • the mean of accuracy is 0.7548
dss_args=dict(type="GradMatchPB",
            fraction=0.1,
            select_every=20,
            lam=0,
            selection_type='PerBatch',
            v1=False,
            valid=False,
            eps=1e-100,
            linear_layer=True,
            kappa=0),

GradMatchPB beginning 10 epoch training:

Index Subset selection time (second) A training epoch time (second) Test Accuracy
1 25.85 30.91 0.3588
2 25.61 30.72 0.3707
3 25.39 31.07 0.4201
4 28.71 34.43 0.4314
5 28.69 33.85 0.4748
6 25.81 31.17 0.485
7 29.03 34.72 0.4881
8 26.78 31.85 0.511
9 25.82 31.45 0.537
10 25.4 30.47 0.5535
Mean 26.7 32.06 0.463

Full dataset beginning 10 epoch training:

Index A training epoch time (second) Test Accuracy
1 51.59 0.5279
2 52.13 0.6543
3 50.17 0.7183
4 51.26 0.7495
5 51.62 0.7779
6 50.14 0.8205
7 47.99 0.8026
8 51.54 0.8324
9 49.91 0.8229
10 52.32 0.8423
Mean 50.867 0.7548

shiyf129 avatar Jul 25 '22 09:07 shiyf129

@shiyf129 why is subset selection happening every epoch? We usually set it to 20. Subset selection takes some time and you dont need to select a subset every time.

Furthermore, training with 10% subset should be 10x faster than full dataset training. From your logs, it doesn't seem that way. Can you check if you create a 10% subset of dataset and train on it for one epoch, is it 10x faster than full training?

krishnatejakk avatar Jul 25 '22 13:07 krishnatejakk

@krishnatejakk I modified the code to select a subset every 20 epoches. I run the cifar10 dataset on ResNet18 model to compare GradMatchPB and Full dataset. Both of them run for 10 minutes and record the test accuracy every minute. The average test accuracy of GradMatchPB is slightly lower than that of full dataset. What is the reason for this?

  Full dataset GradMatchPB (fraction=0.3) GradMatchPB (fraction=0.1)
Average test accuracy 0.7633 0.7515 0.6714

shiyf129 avatar Jul 26 '22 12:07 shiyf129