nni icon indicating copy to clipboard operation
nni copied to clipboard

Can't use OrderedDict inside nn.LayerChioce when using ProxylessTrainer

Open AL3708 opened this issue 3 years ago • 1 comments

ProxylessTrainer forces to use list of ops candidates (can't use OrderedDict) inside nn.LayerChoice. That's due to fact that ops order is mapped to name and used inside latency predictor. That's inconsistent with documentation, which says that both can be used.

Ex. If block is used:

class ConvBlock(nn.Module):
    def __init__(self, in_channels: int, out_channels: int):
        super().__init__()
        self.block = nn.LayerChoice(OrderedDict([
            # conv block is standard Conv-bn-act
            ('3x3', ConvBlock(in_channels, out_channels, kernel_size=3)),
            ('1x3', ConvBlock(in_channels, out_channels, kernel_size=(1, 3))),
            ('3x1', ConvBlock(in_channels, out_channels, kernel_size=(3, 1))),
            ('3x3_sep', ConvBlock(in_channels, out_channels, kernel_size=3, groups=in_channels)),
            ('identity', Identity())
        ]))

Then an error is thrown:

Traceback (most recent call last):
  File "C:\Users\...\proxylessnas.py", line 373, in <module>
    main()
  File "C:\Users\...\proxylessnas.py", line 359, in main
    trainer.fit()
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 363, in fit
    self._train_one_epoch(i)
  File "C:\Users\...\proxylessnas.py", line 295, in _train_one_epoch
    logits, loss = self._logits_and_loss_for_arch_update(val_X, val_y)
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 330, in _logits_and_loss_for_arch_update
    expected_latency = self.latency_estimator.cal_expected_latency(current_architecture_prob)
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 168, in cal_expected_latency
    lat += torch.sum(torch.tensor([probs[i] * self.block_latency_table[module_name][str(i)]
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 168, in <listcomp>
    lat += torch.sum(torch.tensor([probs[i] * self.block_latency_table[module_name][str(i)]
KeyError: '0'

Environment:

  • NNI version: 2.8
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: Windows 10
  • Python version: 3.10
  • PyTorch version: 1.12
  • Is conda/virtualenv/venv used?: Pipenv
  • Is running in Docker?: No

AL3708 avatar Aug 22 '22 14:08 AL3708

This is indeed a mis-handled case.

However, ProxylessTrainer has been deprecated, and thus we don't have hands on fixing this issue. This is an unfortunate fact, but you can try to fix it and contribute back if you are interested.

ultmaster avatar Aug 23 '22 05:08 ultmaster

You might want to try the latest version (v2.9).

matluster avatar Sep 28 '22 02:09 matluster

ProxylessTrainer forces to use list of ops candidates (can't use OrderedDict) inside nn.LayerChoice. That's due to fact that ops order is mapped to name and used inside latency predictor. That's inconsistent with documentation, which says that both can be used.

Ex. If block is used:

class ConvBlock(nn.Module):
    def __init__(self, in_channels: int, out_channels: int):
        super().__init__()
        self.block = nn.LayerChoice(OrderedDict([
            # conv block is standard Conv-bn-act
            ('3x3', ConvBlock(in_channels, out_channels, kernel_size=3)),
            ('1x3', ConvBlock(in_channels, out_channels, kernel_size=(1, 3))),
            ('3x1', ConvBlock(in_channels, out_channels, kernel_size=(3, 1))),
            ('3x3_sep', ConvBlock(in_channels, out_channels, kernel_size=3, groups=in_channels)),
            ('identity', Identity())
        ]))

Then an error is thrown:

Traceback (most recent call last):
  File "C:\Users\...\proxylessnas.py", line 373, in <module>
    main()
  File "C:\Users\...\proxylessnas.py", line 359, in main
    trainer.fit()
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 363, in fit
    self._train_one_epoch(i)
  File "C:\Users\...\proxylessnas.py", line 295, in _train_one_epoch
    logits, loss = self._logits_and_loss_for_arch_update(val_X, val_y)
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 330, in _logits_and_loss_for_arch_update
    expected_latency = self.latency_estimator.cal_expected_latency(current_architecture_prob)
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 168, in cal_expected_latency
    lat += torch.sum(torch.tensor([probs[i] * self.block_latency_table[module_name][str(i)]
  File "C:\Users\...\lib\site-packages\nni\retiarii\oneshot\pytorch\proxyless.py", line 168, in <listcomp>
    lat += torch.sum(torch.tensor([probs[i] * self.block_latency_table[module_name][str(i)]
KeyError: '0'

Environment:

  • NNI version: 2.8
  • Training service (local|remote|pai|aml|etc): local
  • Client OS: Windows 10
  • Python version: 3.10
  • PyTorch version: 1.12
  • Is conda/virtualenv/venv used?: Pipenv
  • Is running in Docker?: No

@AL3708 - had you get a chance to upgrade your nni to 2.9?

scarlett2018 avatar Oct 08 '22 08:10 scarlett2018

feel free to reopen if you have any other question. @AL3708

Lijiaoa avatar Mar 10 '23 01:03 Lijiaoa