sanchitintel comments

Results 62 comments of


                                            sanchitintel

Fused CPU Adam performance

Hi @msaroufim, Upon changing ```model, optimizer = ipex.optimize(model = model, optimizer=torch.optim.Adam(model.parameters()))``` to ```_, optimizer = ipex.optimize(model = model, optimizer=torch.optim.Adam(model.parameters()))```, at my end, I did see that the former was ~10%...

Fused CPU Adam performance

Investigating why `4` was faster than `2` or `3`

Fused CPU Adam performance

Hi @msaroufim, when `ipex.optimize` is used, [`_copy_model_and_optimizer`](https://github.com/intel/intel-extension-for-pytorch/blob/main/intel_extension_for_pytorch/frontend.py#L41) is called, if the model & optimizer can't be modified inplace, which is the default case. This method is responsible for the speedup...

Fused CPU Adam performance

Rather non-intuitively, deep-copying the optimizer results in the ~10% speedup for `4` over `2`/`3`. I verified this hypothesis by simply commenting out most of the code in `_copy_model_and_optimizer`. https://github.com/intel/intel-extension-for-pytorch/blob/main/intel_extension_for_pytorch/frontend.py#L46 @jgong5...

Fused CPU Adam performance

@jgong5 @zhuhaozhe @Guobing-Chen, one remaining issue is datapoint `1` being faster than datapoint `6` (i.e. PyTorch eager mode being faster than `torch.compile` for unfused Adam optimizer), which might also result...

Fused CPU Adam performance

Setting `OMP_NUM_THREADS` & `MKL_NUM_THREADS` environment variables (or using `torch.set_num_threads`) reduces the gap between `1` & `6` but doesn't eliminate it. I used something like this (In `lscpu` output, cores 0-15...

Fused CPU Adam performance

@msaroufim @jgong5, There are [graph breaks ](https://github.com/pytorch/pytorch/blob/15529de90144fdf8681d518483b5acbe944ad2e4/docs/source/torch.compiler_profiling_torch_compile.rst#finding-graph-breaks-torch-compiled-region-and-compiledfunction)with `torch.compile` when an unfused optimizer is used. That's resulting in the overhead.

sanchitintel

Fused CPU Adam performance

Fused CPU Adam performance

Fused CPU Adam performance

Fused CPU Adam performance

Fused CPU Adam performance

Fused CPU Adam performance

Fused CPU Adam performance

Fused CPU Adam performance

[Inductor] added aten.geometric_ decomp

Vectorize torch.exp2 on CPU and add complex support