jacinto-ai-devkit
jacinto-ai-devkit copied to clipboard
pytorch-jacinto-ai-devkit train time problem
hello,I have a question. when I train my model without xnn.quantize.QuantTrainModule, it costs 30minutes one epoch. but when I add xnn.quantize.QuantTrainModule in my train code,it costs 4hours one epoch. they all use same config.
The Quantization simulation required for QAT is done in Pytorch code. This may be the reason for slowness. It will be faster if it is done in the underlying C++ foundation of Pytorch. (Sometime later, I plan to try out PyTorch's native quantization scheme and see if it is faster).
You don't need to run as many epochs in QAT compared to the original training - lower number of epochs are sufficient. So hopefully it is okay although it is slow.
The Quantization simulation required for QAT is done in Pytorch code. This may be the reason for slowness. It will be faster if it is done in the underlying C++ foundation of Pytorch. (Sometime later, I plan to try out PyTorch's native quantization scheme and see if it is faster).
You don't need to run as many epochs in QAT compared to the original training - lower number of epochs are sufficient. So hopefully it is okay although it is slow.
thanks,Is the same with yours? when adding xnn.quantize.QuantTrainModule, Will training slow down? This is my quant train code, is there an error?
from future import absolute_import from future import division from future import print_function
import _init_paths
import os
import torch import torch.utils.data from opts import opts from models.model import create_model, load_model, save_model from models.data_parallel import DataParallel from logger import Logger from datasets.dataset_factory import get_dataset from trains.train_factory import train_factory from pytorch_jacinto_ai import xnn import torch.backends.cudnn as cudnn import torchsummary
def get_model_orig(model): is_parallel_model = isinstance(model, (torch.nn.DataParallel, torch.nn.parallel.DistributedDataParallel)) model_orig = (model.module if is_parallel_model else model) model_orig = (model_orig.module if isinstance(model_orig, (xnn.quantize.QuantBaseModule)) else model_orig) return model_orig
def main(opt): torch.manual_seed(opt.seed) torch.backends.cudnn.benchmark = not opt.not_cuda_benchmark and not opt.test Dataset = get_dataset(opt.dataset, opt.task) opt = opts().update_dataset_info_and_set_heads(opt, Dataset) print(opt)
logger = Logger(opt)
os.environ['CUDA_VISIBLE_DEVICES'] = opt.gpus_str opt.device = torch.device('cuda' if opt.gpus[0] >= 0 else 'cpu')
print('Creating model...') model = create_model(opt.arch, opt.heads, opt.head_conv, opt.show_gmacs) dummy_input = torch.rand((1, 3, 512, 1024)) model = xnn.quantize.QuantTrainModule(model, dummy_input=dummy_input) model = get_model_orig(model) optimizer = torch.optim.Adam(model.parameters(), opt.lr) start_epoch = 0 if opt.load_model != '': model, optimizer, start_epoch = load_model(i model, opt.load_model, optimizer, opt.resume, opt.lr, opt.lr_step)
Trainer = train_factory[opt.task] trainer = Trainer(opt, model, optimizer) trainer.set_device(opt.gpus, opt.chunk_sizes, opt.device)
print('Setting up data...') val_loader = torch.utils.data.DataLoader( Dataset(opt, 'val'), batch_size=1, shuffle=False, num_workers=8, pin_memory=True )
if opt.test: _, preds = trainer.val(0, val_loader) val_loader.dataset.run_eval(preds, opt.save_dir) return
train_loader = torch.utils.data.DataLoader( Dataset(opt, 'train'), batch_size=opt.batch_size, shuffle=True, num_workers=opt.num_workers, pin_memory=True, drop_last=True )
print('Starting training...') best = 1e10 for epoch in range(start_epoch + 1, opt.num_epochs + 1): mark = epoch if opt.save_all else 'last' log_dict_train, _ = trainer.train(epoch, train_loader) logger.write('epoch: {} |'.format(epoch)) for k, v in log_dict_train.items(): logger.scalar_summary('train_{}'.format(k), v, epoch) logger.write('{} {:8f} | '.format(k, v)) if opt.val_intervals > 0 and epoch % opt.val_intervals == 0: save_model(os.path.join(opt.save_dir, 'model_{}.pth'.format(mark)), epoch, model, optimizer) with torch.no_grad(): log_dict_val, preds = trainer.val(epoch, val_loader) for k, v in log_dict_val.items(): logger.scalar_summary('val_{}'.format(k), v, epoch) logger.write('{} {:8f} | '.format(k, v)) if log_dict_val[opt.metric] < best: best = log_dict_val[opt.metric] save_model(os.path.join(opt.save_dir, 'model_best.pth'), epoch, model) else: # dummy_input = torch.rand((1, 3, 512, 1024)) # dummy_input = dummy_input.to(opt.device) # torch.onnx.export(model, dummy_input, os.path.join(opt.save_dir,'model_last.onnx'), export_params=True, verbose=False, do_constant_folding=True, opset_version=9) save_model(os.path.join(opt.save_dir, 'model_last.pth'), epoch, model, optimizer) logger.write('\n') if epoch in opt.lr_step: save_model(os.path.join(opt.save_dir, 'model_{}.pth'.format(epoch)), epoch, model, optimizer) lr = opt.lr * (0.1 ** (opt.lr_step.index(epoch) + 1)) print('Drop LR to', lr) for param_group in optimizer.param_groups: param_group['lr'] = lr logger.close()
if name == 'main': opt = opts().parse() main(opt)
I believe there is some mistake. But this will not change the speed. QAT training will be slower than regular training.
For QAT training you need to give the model wrapped in QuantTrainModule. Try correcting the code as shown below.
print('Creating model...') model = create_model(opt.arch, opt.heads, opt.head_conv, opt.show_gmacs) dummy_input = torch.rand((1, 3, 512, 1024)) model = xnn.quantize.QuantTrainModule(model, dummy_input=dummy_input) model_orig = get_model_orig(model) optimizer = torch.optim.Adam(model.parameters(), opt.lr) start_epoch = 0 if opt.load_model != '': model_orig, optimizer, start_epoch = load_model(i model_orig, opt.load_model, optimizer, opt.resume, opt.lr, opt.lr_step)
Trainer = train_factory[opt.task] trainer = Trainer(opt, model, optimizer)
As you said, I modified my quant-train code. My task is Object Detection, I use the model(int8) before quant-train, the AP is 88.26%,but I use the model(int8) after quant-train,the AP is 88.29%, It improves very little. I use the model(int16) before quant-train, the AP is 94.83%. This is my new quant-train code. `from future import absolute_import from future import division from future import print_function
import _init_paths
import os
import torch import torch.utils.data from opts import opts from models.model import create_model, load_model, save_model from models.data_parallel import DataParallel from logger import Logger from datasets.dataset_factory import get_dataset from trains.train_factory import train_factory from pytorch_jacinto_ai import xnn import torch.backends.cudnn as cudnn import torchsummary
def get_model_orig(model): is_parallel_model = isinstance(model, (torch.nn.DataParallel, torch.nn.parallel.DistributedDataParallel)) model_orig = (model.module if is_parallel_model else model) model_orig = (model_orig.module if isinstance(model_orig, (xnn.quantize.QuantBaseModule)) else model_orig) return model_orig
def main(opt): torch.manual_seed(opt.seed) torch.backends.cudnn.benchmark = not opt.not_cuda_benchmark and not opt.test Dataset = get_dataset(opt.dataset, opt.task) opt = opts().update_dataset_info_and_set_heads(opt, Dataset) print(opt)
logger = Logger(opt)
os.environ['CUDA_VISIBLE_DEVICES'] = opt.gpus_str opt.device = torch.device('cuda' if opt.gpus[0] >= 0 else 'cpu')
print('Creating model...') model = create_model(opt.arch, opt.heads, opt.head_conv, opt.show_gmacs) dummy_input = torch.rand((1, 3, 512, 1024)) model = xnn.quantize.QuantTrainModule(model, dummy_input=dummy_input) model_orig = get_model_orig(model) optimizer = torch.optim.Adam(model.parameters(), opt.lr) start_epoch = 0 if opt.load_model != '': model_orig, optimizer, start_epoch = load_model( model_orig, opt.load_model, optimizer, opt.resume, opt.lr, opt.lr_step)
Trainer = train_factory[opt.task] trainer = Trainer(opt, model, optimizer) trainer.set_device(opt.gpus, opt.chunk_sizes, opt.device)
print('Setting up data...') val_loader = torch.utils.data.DataLoader( Dataset(opt, 'val'), batch_size=1, shuffle=False, num_workers=8, pin_memory=True )
if opt.test: _, preds = trainer.val(0, val_loader) val_loader.dataset.run_eval(preds, opt.save_dir) return
train_loader = torch.utils.data.DataLoader( Dataset(opt, 'train'), batch_size=opt.batch_size, shuffle=True, num_workers=opt.num_workers, pin_memory=True, drop_last=True )
print('Starting training...') best = 1e10 for epoch in range(start_epoch + 1, opt.num_epochs + 1): mark = epoch if opt.save_all else 'last' log_dict_train, _ = trainer.train(epoch, train_loader) logger.write('epoch: {} |'.format(epoch)) for k, v in log_dict_train.items(): logger.scalar_summary('train_{}'.format(k), v, epoch) logger.write('{} {:8f} | '.format(k, v)) model_orig = model.module if isinstance(model, (torch.nn.parallel.DistributedDataParallel, torch.nn.parallel.DataParallel)) else model model_orig = model_orig.module if opt.val_intervals > 0 and epoch % opt.val_intervals == 0: save_model(os.path.join(opt.save_dir, 'model_{}.pth'.format(mark)), epoch, model_orig, optimizer) with torch.no_grad(): log_dict_val, preds = trainer.val(epoch, val_loader) for k, v in log_dict_val.items(): logger.scalar_summary('val_{}'.format(k), v, epoch) logger.write('{} {:8f} | '.format(k, v)) if log_dict_val[opt.metric] < best: best = log_dict_val[opt.metric] dummy_input = torch.rand((1, 3, 512, 1024)) dummy_input = dummy_input.to(opt.device) torch.onnx.export(model_orig, dummy_input, os.path.join(opt.save_dir,'model_best.onnx'), export_params=True, verbose=False, do_constant_folding=True, opset_version=9) save_model(os.path.join(opt.save_dir, 'model_best.pth'), epoch, model_orig) else: dummy_input = torch.rand((1, 3, 512, 1024)) dummy_input = dummy_input.to(opt.device) torch.onnx.export(model_orig, dummy_input, os.path.join(opt.save_dir,'model_last.onnx'), export_params=True, verbose=False, do_constant_folding=True, opset_version=9) save_model(os.path.join(opt.save_dir, 'model_last.pth'), epoch, model_orig, optimizer) logger.write('\n') if epoch in opt.lr_step: save_model(os.path.join(opt.save_dir, 'model_{}.pth'.format(epoch)), epoch, model_orig, optimizer) lr = opt.lr * (0.1 ** (opt.lr_step.index(epoch) + 1)) print('Drop LR to', lr) for param_group in optimizer.param_groups: param_group['lr'] = lr logger.close()
if name == 'main': opt = opts().parse() main(opt)`
Hi, what is the accuracy that you get with QAT in the PyTorch code during validation?
Also, please tell me the accuracy that you get with 8-bit quantization when you set calibrationOption = 7 in the TIDL import config. This should be with the original model (not the QAT model).
1、I don't calculate the accuracy with QAT in the PyTorch code during validation. 2、when I set calibrationOption = 7 in the TIDL import config, the AP is 86.43%(8-bit quantization, original model ). And the 16-bit inference provides accuracy close to floating point(94%).
-
Have you followed the guidelines and restrictions in https://git.ti.com/cgit/jacinto-ai/pytorch-jacinto-ai-devkit/about/docs/Quantization.md Typically when i ask this question everybody says YES, but then when we dig deep and look closer, many times it is not followed closely. So please double check and make sure.
-
What is the weight decay value being used? Please check if this eight decay / regularization is being applied to all parameters.
-
How many images do you use for import/calibration in TIDL? Can you try to use at least 50 to 100 diverse images from your training set and see if it improves the accuracy?
-
When you use a regular float model in TIDL, calibrationOption=7 during TIDL import can many times improve the accuracy. But when you use a QAT model in TIDL, always set calibrationOption=0
Thank you very much. I will check carefully and make sure.
Great! Let us know your progress.
Hi,
I am not knowledgeable about vision apps and how TIDL is integrated in the system. Can you try asking the question in the following forum: https://e2e.ti.com/support/processors/f/791/tags/TIDL
There are experts there who can help you with questions regarding vision apps as well as other system level examples.
I have checked my model structure carefully. The AP is still low compared with floating point result. Have you used centernet to do object detection? My backbone is resnet18, neck is fpn, head is centernet head. Thank you.
mmdetection has promised that they will support CenterNet and I am looking forward for it to be supported: https://github.com/open-mmlab/mmdetection/issues/2931 "We have heard the voice from the community for CenterNet, and will increase the priority in our roadmap. Hopefully we will introduce it to mmdet V2.4."
Hi we have published an Object Detection training package called Pytorch-MMDetection which is basically an extension of mmdetection. We support several low complexity models and if you study the accuracy vs complexity (GigaMACS), not many models in the public domain can match that at such low compleixty. https://github.com/TexasInstruments/jacinto-ai-devkit https://git.ti.com/cgit/jacinto-ai/pytorch-mmdetection/about/
We support both SSD and RetinaNet training. You can also export model into onnx+prototxt that TIDL readily understands - so inference on TIDL is straightforward. The detection head in TIDL is being optimized and it will be much faster in the next release for SSD and RetinaNet.
We have even better results (Accuracy) than we listed there in that repository and soon we shall publish the new results (and if possible, models as well).
I am saying this so that if you want to training SSD or RetinaNet using our repository, you can do that.
Another option is for you share your CenterNet ONNX model (both without and with QAT) and I can take a look at it. Btw, who is the FAE/Apps Support Engineer from TI that you are interacting with? I can also have a word with him on how to support you best.
Thank you, I will try the SSD.
@WangGangUCAS Have a look at our latest Object Detection results: https://git.ti.com/cgit/jacinto-ai/pytorch-mmdetection/about/docs/det_modelzoo.md
I believe they are quite competitive if you consider the complexity and accuracy.