jacinto-ai-devkit icon indicating copy to clipboard operation
jacinto-ai-devkit copied to clipboard

pytorch-jacinto-ai-devkit train time problem

Open WangGangUCAS opened this issue 5 years ago • 17 comments

hello,I have a question. when I train my model without xnn.quantize.QuantTrainModule, it costs 30minutes one epoch. but when I add xnn.quantize.QuantTrainModule in my train code,it costs 4hours one epoch. they all use same config.

WangGangUCAS avatar Aug 28 '20 08:08 WangGangUCAS

The Quantization simulation required for QAT is done in Pytorch code. This may be the reason for slowness. It will be faster if it is done in the underlying C++ foundation of Pytorch. (Sometime later, I plan to try out PyTorch's native quantization scheme and see if it is faster).

You don't need to run as many epochs in QAT compared to the original training - lower number of epochs are sufficient. So hopefully it is okay although it is slow.

mathmanu avatar Aug 28 '20 09:08 mathmanu

The Quantization simulation required for QAT is done in Pytorch code. This may be the reason for slowness. It will be faster if it is done in the underlying C++ foundation of Pytorch. (Sometime later, I plan to try out PyTorch's native quantization scheme and see if it is faster).

You don't need to run as many epochs in QAT compared to the original training - lower number of epochs are sufficient. So hopefully it is okay although it is slow.

thanks,Is the same with yours? when adding xnn.quantize.QuantTrainModule, Will training slow down? This is my quant train code, is there an error?

from future import absolute_import from future import division from future import print_function

import _init_paths

import os

import torch import torch.utils.data from opts import opts from models.model import create_model, load_model, save_model from models.data_parallel import DataParallel from logger import Logger from datasets.dataset_factory import get_dataset from trains.train_factory import train_factory from pytorch_jacinto_ai import xnn import torch.backends.cudnn as cudnn import torchsummary

def get_model_orig(model): is_parallel_model = isinstance(model, (torch.nn.DataParallel, torch.nn.parallel.DistributedDataParallel)) model_orig = (model.module if is_parallel_model else model) model_orig = (model_orig.module if isinstance(model_orig, (xnn.quantize.QuantBaseModule)) else model_orig) return model_orig

def main(opt): torch.manual_seed(opt.seed) torch.backends.cudnn.benchmark = not opt.not_cuda_benchmark and not opt.test Dataset = get_dataset(opt.dataset, opt.task) opt = opts().update_dataset_info_and_set_heads(opt, Dataset) print(opt)

logger = Logger(opt)

os.environ['CUDA_VISIBLE_DEVICES'] = opt.gpus_str opt.device = torch.device('cuda' if opt.gpus[0] >= 0 else 'cpu')

print('Creating model...') model = create_model(opt.arch, opt.heads, opt.head_conv, opt.show_gmacs) dummy_input = torch.rand((1, 3, 512, 1024)) model = xnn.quantize.QuantTrainModule(model, dummy_input=dummy_input) model = get_model_orig(model) optimizer = torch.optim.Adam(model.parameters(), opt.lr) start_epoch = 0 if opt.load_model != '': model, optimizer, start_epoch = load_model(i model, opt.load_model, optimizer, opt.resume, opt.lr, opt.lr_step)

Trainer = train_factory[opt.task] trainer = Trainer(opt, model, optimizer) trainer.set_device(opt.gpus, opt.chunk_sizes, opt.device)

print('Setting up data...') val_loader = torch.utils.data.DataLoader( Dataset(opt, 'val'), batch_size=1, shuffle=False, num_workers=8, pin_memory=True )

if opt.test: _, preds = trainer.val(0, val_loader) val_loader.dataset.run_eval(preds, opt.save_dir) return

train_loader = torch.utils.data.DataLoader( Dataset(opt, 'train'), batch_size=opt.batch_size, shuffle=True, num_workers=opt.num_workers, pin_memory=True, drop_last=True )

print('Starting training...') best = 1e10 for epoch in range(start_epoch + 1, opt.num_epochs + 1): mark = epoch if opt.save_all else 'last' log_dict_train, _ = trainer.train(epoch, train_loader) logger.write('epoch: {} |'.format(epoch)) for k, v in log_dict_train.items(): logger.scalar_summary('train_{}'.format(k), v, epoch) logger.write('{} {:8f} | '.format(k, v)) if opt.val_intervals > 0 and epoch % opt.val_intervals == 0: save_model(os.path.join(opt.save_dir, 'model_{}.pth'.format(mark)), epoch, model, optimizer) with torch.no_grad(): log_dict_val, preds = trainer.val(epoch, val_loader) for k, v in log_dict_val.items(): logger.scalar_summary('val_{}'.format(k), v, epoch) logger.write('{} {:8f} | '.format(k, v)) if log_dict_val[opt.metric] < best: best = log_dict_val[opt.metric] save_model(os.path.join(opt.save_dir, 'model_best.pth'), epoch, model) else: # dummy_input = torch.rand((1, 3, 512, 1024)) # dummy_input = dummy_input.to(opt.device) # torch.onnx.export(model, dummy_input, os.path.join(opt.save_dir,'model_last.onnx'), export_params=True, verbose=False, do_constant_folding=True, opset_version=9) save_model(os.path.join(opt.save_dir, 'model_last.pth'), epoch, model, optimizer) logger.write('\n') if epoch in opt.lr_step: save_model(os.path.join(opt.save_dir, 'model_{}.pth'.format(epoch)), epoch, model, optimizer) lr = opt.lr * (0.1 ** (opt.lr_step.index(epoch) + 1)) print('Drop LR to', lr) for param_group in optimizer.param_groups: param_group['lr'] = lr logger.close()

if name == 'main': opt = opts().parse() main(opt)

WangGangUCAS avatar Aug 29 '20 01:08 WangGangUCAS

I believe there is some mistake. But this will not change the speed. QAT training will be slower than regular training.

For QAT training you need to give the model wrapped in QuantTrainModule. Try correcting the code as shown below.

print('Creating model...') model = create_model(opt.arch, opt.heads, opt.head_conv, opt.show_gmacs) dummy_input = torch.rand((1, 3, 512, 1024)) model = xnn.quantize.QuantTrainModule(model, dummy_input=dummy_input) model_orig = get_model_orig(model) optimizer = torch.optim.Adam(model.parameters(), opt.lr) start_epoch = 0 if opt.load_model != '': model_orig, optimizer, start_epoch = load_model(i model_orig, opt.load_model, optimizer, opt.resume, opt.lr, opt.lr_step)

Trainer = train_factory[opt.task] trainer = Trainer(opt, model, optimizer)

mathmanu avatar Aug 29 '20 06:08 mathmanu

As you said, I modified my quant-train code. My task is Object Detection, I use the model(int8) before quant-train, the AP is 88.26%,but I use the model(int8) after quant-train,the AP is 88.29%, It improves very little. I use the model(int16) before quant-train, the AP is 94.83%. This is my new quant-train code. `from future import absolute_import from future import division from future import print_function

import _init_paths

import os

import torch import torch.utils.data from opts import opts from models.model import create_model, load_model, save_model from models.data_parallel import DataParallel from logger import Logger from datasets.dataset_factory import get_dataset from trains.train_factory import train_factory from pytorch_jacinto_ai import xnn import torch.backends.cudnn as cudnn import torchsummary

def get_model_orig(model): is_parallel_model = isinstance(model, (torch.nn.DataParallel, torch.nn.parallel.DistributedDataParallel)) model_orig = (model.module if is_parallel_model else model) model_orig = (model_orig.module if isinstance(model_orig, (xnn.quantize.QuantBaseModule)) else model_orig) return model_orig

def main(opt): torch.manual_seed(opt.seed) torch.backends.cudnn.benchmark = not opt.not_cuda_benchmark and not opt.test Dataset = get_dataset(opt.dataset, opt.task) opt = opts().update_dataset_info_and_set_heads(opt, Dataset) print(opt)

logger = Logger(opt)

os.environ['CUDA_VISIBLE_DEVICES'] = opt.gpus_str opt.device = torch.device('cuda' if opt.gpus[0] >= 0 else 'cpu')

print('Creating model...') model = create_model(opt.arch, opt.heads, opt.head_conv, opt.show_gmacs) dummy_input = torch.rand((1, 3, 512, 1024)) model = xnn.quantize.QuantTrainModule(model, dummy_input=dummy_input) model_orig = get_model_orig(model) optimizer = torch.optim.Adam(model.parameters(), opt.lr) start_epoch = 0 if opt.load_model != '': model_orig, optimizer, start_epoch = load_model( model_orig, opt.load_model, optimizer, opt.resume, opt.lr, opt.lr_step)

Trainer = train_factory[opt.task] trainer = Trainer(opt, model, optimizer) trainer.set_device(opt.gpus, opt.chunk_sizes, opt.device)

print('Setting up data...') val_loader = torch.utils.data.DataLoader( Dataset(opt, 'val'), batch_size=1, shuffle=False, num_workers=8, pin_memory=True )

if opt.test: _, preds = trainer.val(0, val_loader) val_loader.dataset.run_eval(preds, opt.save_dir) return

train_loader = torch.utils.data.DataLoader( Dataset(opt, 'train'), batch_size=opt.batch_size, shuffle=True, num_workers=opt.num_workers, pin_memory=True, drop_last=True )

print('Starting training...') best = 1e10 for epoch in range(start_epoch + 1, opt.num_epochs + 1): mark = epoch if opt.save_all else 'last' log_dict_train, _ = trainer.train(epoch, train_loader) logger.write('epoch: {} |'.format(epoch)) for k, v in log_dict_train.items(): logger.scalar_summary('train_{}'.format(k), v, epoch) logger.write('{} {:8f} | '.format(k, v)) model_orig = model.module if isinstance(model, (torch.nn.parallel.DistributedDataParallel, torch.nn.parallel.DataParallel)) else model model_orig = model_orig.module if opt.val_intervals > 0 and epoch % opt.val_intervals == 0: save_model(os.path.join(opt.save_dir, 'model_{}.pth'.format(mark)), epoch, model_orig, optimizer) with torch.no_grad(): log_dict_val, preds = trainer.val(epoch, val_loader) for k, v in log_dict_val.items(): logger.scalar_summary('val_{}'.format(k), v, epoch) logger.write('{} {:8f} | '.format(k, v)) if log_dict_val[opt.metric] < best: best = log_dict_val[opt.metric] dummy_input = torch.rand((1, 3, 512, 1024)) dummy_input = dummy_input.to(opt.device) torch.onnx.export(model_orig, dummy_input, os.path.join(opt.save_dir,'model_best.onnx'), export_params=True, verbose=False, do_constant_folding=True, opset_version=9) save_model(os.path.join(opt.save_dir, 'model_best.pth'), epoch, model_orig) else: dummy_input = torch.rand((1, 3, 512, 1024)) dummy_input = dummy_input.to(opt.device) torch.onnx.export(model_orig, dummy_input, os.path.join(opt.save_dir,'model_last.onnx'), export_params=True, verbose=False, do_constant_folding=True, opset_version=9) save_model(os.path.join(opt.save_dir, 'model_last.pth'), epoch, model_orig, optimizer) logger.write('\n') if epoch in opt.lr_step: save_model(os.path.join(opt.save_dir, 'model_{}.pth'.format(epoch)), epoch, model_orig, optimizer) lr = opt.lr * (0.1 ** (opt.lr_step.index(epoch) + 1)) print('Drop LR to', lr) for param_group in optimizer.param_groups: param_group['lr'] = lr logger.close()

if name == 'main': opt = opts().parse() main(opt)`

WangGangUCAS avatar Aug 31 '20 02:08 WangGangUCAS

Hi, what is the accuracy that you get with QAT in the PyTorch code during validation?

mathmanu avatar Aug 31 '20 04:08 mathmanu

Also, please tell me the accuracy that you get with 8-bit quantization when you set calibrationOption = 7 in the TIDL import config. This should be with the original model (not the QAT model).

mathmanu avatar Aug 31 '20 04:08 mathmanu

1、I don't calculate the accuracy with QAT in the PyTorch code during validation. 2、when I set calibrationOption = 7 in the TIDL import config, the AP is 86.43%(8-bit quantization, original model ). And the 16-bit inference provides accuracy close to floating point(94%).

WangGangUCAS avatar Aug 31 '20 06:08 WangGangUCAS

  1. Have you followed the guidelines and restrictions in https://git.ti.com/cgit/jacinto-ai/pytorch-jacinto-ai-devkit/about/docs/Quantization.md Typically when i ask this question everybody says YES, but then when we dig deep and look closer, many times it is not followed closely. So please double check and make sure.

  2. What is the weight decay value being used? Please check if this eight decay / regularization is being applied to all parameters.

  3. How many images do you use for import/calibration in TIDL? Can you try to use at least 50 to 100 diverse images from your training set and see if it improves the accuracy?

  4. When you use a regular float model in TIDL, calibrationOption=7 during TIDL import can many times improve the accuracy. But when you use a QAT model in TIDL, always set calibrationOption=0

mathmanu avatar Aug 31 '20 07:08 mathmanu

Thank you very much. I will check carefully and make sure.

WangGangUCAS avatar Aug 31 '20 07:08 WangGangUCAS

Great! Let us know your progress.

mathmanu avatar Aug 31 '20 07:08 mathmanu

Hi,

I am not knowledgeable about vision apps and how TIDL is integrated in the system. Can you try asking the question in the following forum: https://e2e.ti.com/support/processors/f/791/tags/TIDL

There are experts there who can help you with questions regarding vision apps as well as other system level examples.

mathmanu avatar Sep 03 '20 03:09 mathmanu

I have checked my model structure carefully. The AP is still low compared with floating point result. Have you used centernet to do object detection? My backbone is resnet18, neck is fpn, head is centernet head. Thank you.

WangGangUCAS avatar Sep 21 '20 05:09 WangGangUCAS

mmdetection has promised that they will support CenterNet and I am looking forward for it to be supported: https://github.com/open-mmlab/mmdetection/issues/2931 "We have heard the voice from the community for CenterNet, and will increase the priority in our roadmap. Hopefully we will introduce it to mmdet V2.4."

mathmanu avatar Sep 21 '20 05:09 mathmanu

Hi we have published an Object Detection training package called Pytorch-MMDetection which is basically an extension of mmdetection. We support several low complexity models and if you study the accuracy vs complexity (GigaMACS), not many models in the public domain can match that at such low compleixty. https://github.com/TexasInstruments/jacinto-ai-devkit https://git.ti.com/cgit/jacinto-ai/pytorch-mmdetection/about/

We support both SSD and RetinaNet training. You can also export model into onnx+prototxt that TIDL readily understands - so inference on TIDL is straightforward. The detection head in TIDL is being optimized and it will be much faster in the next release for SSD and RetinaNet.

We have even better results (Accuracy) than we listed there in that repository and soon we shall publish the new results (and if possible, models as well).

I am saying this so that if you want to training SSD or RetinaNet using our repository, you can do that.

mathmanu avatar Sep 21 '20 05:09 mathmanu

Another option is for you share your CenterNet ONNX model (both without and with QAT) and I can take a look at it. Btw, who is the FAE/Apps Support Engineer from TI that you are interacting with? I can also have a word with him on how to support you best.

mathmanu avatar Sep 21 '20 06:09 mathmanu

Thank you, I will try the SSD.

WangGangUCAS avatar Sep 22 '20 02:09 WangGangUCAS

@WangGangUCAS Have a look at our latest Object Detection results: https://git.ti.com/cgit/jacinto-ai/pytorch-mmdetection/about/docs/det_modelzoo.md

I believe they are quite competitive if you consider the complexity and accuracy.

mathmanu avatar Oct 05 '20 08:10 mathmanu