EfficientNet-PyTorch Transfer Learning not working

I'm trying to use pretrained model b-1 to train the model on Places365 but the training is blocking at ~25% (accuracy). I used Imagenet auto-augment policy founded here using this code: Dataloaders :


def _get_train_data_loader(batch_size, training_dir, is_distributed, **kwargs):
    logger.info(str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S ")) + "Get train data loader")
    base_dir = '/dev/shm/places365_standard/'
    defaults.device = torch.device('cuda')

    dataset = datasets.ImageFolder(base_dir+"train", transform=transforms.Compose(
                        [transforms.Resize(224, interpolation=PIL.Image.BICUBIC), 
                         ImageNetPolicy(), 
                         transforms.ToTensor(), 
                         transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))]))
    
    train_sampler = torch.utils.data.distributed.DistributedSampler(dataset) 
    return torch.utils.data.DataLoader(dataset, batch_size=batch_size, pin_memory=True, num_workers=8, sampler=train_sampler)


def _get_test_data_loader(test_batch_size, training_dir, **kwargs):
    logger.info(str(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S ")) + "Get test data loader")
    base_dir = '/dev/shm/places365_standard/'
    defaults.device = torch.device('cuda')
    

    dataset = datasets.ImageFolder(base_dir+"val", transform=transforms.Compose(
                        [transforms.Resize(224, interpolation=PIL.Image.BICUBIC), 
                         transforms.ToTensor(),
                         transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
                         ]))
    return torch.utils.data.DataLoader(dataset, batch_size=test_batch_size, num_workers=8, shuffle=True, pin_memory=True)

Training code :

    model = EfficientNet.from_pretrained('efficientnet-b1', num_classes=365).to(device)
    
    for n, p in model.named_parameters():
        if '_fc' not in n:
            p.requires_grad = False

    model = torch.nn.parallel.DistributedDataParallel(model)
    
    optimizer = optim.RMSprop(model.parameters(), lr=3e-2, alpha=0.99, 
                                                  eps=1e-08, weight_decay=1e-5, momentum=0.9)
    lmbda = lambda epoch: 0.98739
    scheduler = optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda=lmbda)
    criterion = nn.CrossEntropyLoss()
    
    best_loss = 10000000
    
    for epoch in range(1, args.epochs + 1):
        model.train()
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.cuda(non_blocking=True), target.cuda(non_blocking=True)
            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            if is_distributed and not use_cuda:
                # average gradients manually for multi-machine cpu case only
                _average_gradients(model)
            optimizer.step()
            if batch_idx % (len(train_loader)-1) == 0 and batch_idx != 0:
                log = 'Train Epoch: {} [{}/{} ({:.0f}%)] Loss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(train_loader.sampler),
                    100. * batch_idx / len(train_loader), loss.item())
                logger.info(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S ") + log)

        test_loss = test(model, test_loader, device)
        scheduler.step()
        if test_loss < best_loss:
            logger.info(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S ") + "Best loss : Saving")
            save_model(model, args.model_dir)
            best_loss = test_loss

test function:


def test(model, test_loader, device):
    model.eval()
    test_loss = 0
    correct = 0
    crit = nn.CrossEntropyLoss(size_average=False)
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.cuda(non_blocking=True), target.cuda(non_blocking=True)
            output = model(data)
            test_loss += crit(output, target).item()  # sum up batch loss
            pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    logger.info(datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S ") + 'Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    return test_loss

I don't know what I'm doing wrong ? any help ?

Jun 02 '20 10:06 gost-sniper

maybe you freeze '_fc' layers

Jun 23 '20 06:06 ziliwang

I'm not suer about this issue. In general, EfficientNets are very hard to train. For future reference, make sure you can:

Do transfer learning by freezing all but the last layer (another way to do this is to construct a simple linear model on top of the .extract_features function
Overfit on a small percent of the training data
Train a different model (e.g. a ResNet) successfully on your full dataset

Then return to trying to train EfficientNet on your full dataset.

Jun 23 '20 22:06 lukemelas

maybe you freeze '_fc' layers

No, I froze all but the '_fc' layer

Jun 23 '20 22:06 gost-sniper

Same. I used a different dataset, and also see accuracy being around the 25% range.

Aug 19 '20 04:08 datduong

I'm also training with EfficientNet-B2 using the Places365_Standard dataset. I'm training _swish from the last block (_block.22) and freezing the rest of it. I'm currently at about 40% Acc1 in the validation data, any good advice on this issue?

Oct 18 '20 23:10 teraoka-hiroshi

@lukemelas's advice are very helpful try it out.

Nov 17 '20 10:11 gost-sniper

@gost-sniper @lukemelas In b2, the freeze layer was implemented with blocks 20, 21, 22 and the FC layer. As a result, the percentage of correct answers increased to nearly 55% in the Placese365 standard data. Accuracy was improved by the decay of lr at the 30, 60, and 90 timings.

Nov 21 '20 13:11 teraoka-hiroshi

@aporo4000 can you show the code used for the training phase?

Nov 21 '20 13:11 gost-sniper

@gost-sniper @lukemelas In b2, the freeze layer was implemented with blocks 20, 21, 22 and the FC layer. As a result, the percentage of correct answers increased to nearly 55% in the Placese365 standard data. Accuracy was improved by the decay of lr at the 30, 60, and 90 timings.

Why freeze FC layer would work? It seems that it doesn't make sense.

Dec 17 '20 03:12 ANYMS-A

@crissallan I made a mistake in writing In b2, blocks 20, 21, 22, and all but the FC layer were implemented as a freeze layer.

Jan 12 '21 06:01 teraoka-hiroshi

@gost-sniper Less efficiently, we have created a fixed layer with

model = EfficientNet.from_pretrained(args.arch, advprop=args.advprop, num_classes=365) 
            for param in model.parameters():
                param.requires_grad = False
            for name, module in model.named_modules():
                if  name == '_blocks.20' or \
                    name == '_blocks.21' or \
                    name == '_blocks.22' or \
                    name == '_fc':
                    for param in module.parameters():
                        param.requires_grad = True

Jan 12 '21 06:01 teraoka-hiroshi

Hi @gost-sniper did you fixed the problem?

Could you please share with me your training code? ([email protected])

I am facing problems with a code I made here.

Thank you!

Nov 11 '21 22:11 alancarlosml

EfficientNet-PyTorch EfficientNet-PyTorch copied to clipboard

Transfer Learning not working

EfficientNet-PyTorch
EfficientNet-PyTorch copied to clipboard