sanchitintel
sanchitintel
Hi @msaroufim, Upon changing ```model, optimizer = ipex.optimize(model = model, optimizer=torch.optim.Adam(model.parameters()))``` to ```_, optimizer = ipex.optimize(model = model, optimizer=torch.optim.Adam(model.parameters()))```, at my end, I did see that the former was ~10%...
Investigating why `4` was faster than `2` or `3`
Hi @msaroufim, when `ipex.optimize` is used, [`_copy_model_and_optimizer`](https://github.com/intel/intel-extension-for-pytorch/blob/main/intel_extension_for_pytorch/frontend.py#L41) is called, if the model & optimizer can't be modified inplace, which is the default case. This method is responsible for the speedup...
Rather non-intuitively, deep-copying the optimizer results in the ~10% speedup for `4` over `2`/`3`. I verified this hypothesis by simply commenting out most of the code in `_copy_model_and_optimizer`. https://github.com/intel/intel-extension-for-pytorch/blob/main/intel_extension_for_pytorch/frontend.py#L46 @jgong5...
@jgong5 @zhuhaozhe @Guobing-Chen, one remaining issue is datapoint `1` being faster than datapoint `6` (i.e. PyTorch eager mode being faster than `torch.compile` for unfused Adam optimizer), which might also result...
Setting `OMP_NUM_THREADS` & `MKL_NUM_THREADS` environment variables (or using `torch.set_num_threads`) reduces the gap between `1` & `6` but doesn't eliminate it. I used something like this (In `lscpu` output, cores 0-15...
@msaroufim @jgong5, There are [graph breaks ](https://github.com/pytorch/pytorch/blob/15529de90144fdf8681d518483b5acbe944ad2e4/docs/source/torch.compiler_profiling_torch_compile.rst#finding-graph-breaks-torch-compiled-region-and-compiledfunction)with `torch.compile` when an unfused optimizer is used. That's resulting in the overhead.
Hi @msaroufim, these graph breaks are being used in PyTorch source-code. As per https://github.com/pytorch/pytorch/issues/104053, they will be removed when solution 3 in that ticket (`the entire graph is an inference...
@pytorchbot rebase -b master
Hi @peterbell10 @mingfeima @jgong5, do we have some sort of a list of ops which are still not vectorized? Thanks!