intel-extension-for-pytorch Fused CPU Adam performance

Describe the issue

I'm trying to leverage a fast CPU ADAM implementation and I've found many ways of doing so that provide slightly different perf. One setting is downright confusing as well so opening this issue to discuss

Repro is here

Results

Existing Adam optimizer time using PyTorch eager: 3.4665 seconds
Fused Adam optimizer time using optimizer_fusion: 3.2542 seconds
Fused Adam optimizer time using ipex_adam_step: 3.2268 seconds
Fused Adam optimizer time using ipex.optimize but only optimize the optimizer: 2.7120 seconds
Fused Adam optimizer time using ipex.optimize but optimize both the model and the optimizer: 3.3123 seconds (this makes no sense to me)
torch.compile optimizer time: 4.1160 seconds

Experiments were performed on

(fresh) (base) ubuntu@ip-172-31-48-15:~/tinyoptimizer/cpu_optimizer/ipex$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8488C
Stepping:                        8
CPU MHz:                         2400.000
BogoMIPS:                        4800.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       768 KiB
L1i cache:                       512 KiB
L2 cache:                        32 MiB
L3 cache:                        105 MiB
NUMA node0 CPU(s):               0-31
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtop
                                 ology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand h
                                 ypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx51
                                 2ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni
                                  vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabil
                                 ities
(fresh) (base) ubuntu@ip-172-31-48-15:~/tinyoptimizer/cpu_optimizer/ipex$

Mar 28 '24 00:03 msaroufim

@msaroufim What's your expected result?

Mar 28 '24 01:03 xiguiw

This is the main one that's throwing me off

Fused Adam optimizer time using ipex.optimize but only optimize the optimizer: 2.7120 seconds Fused Adam optimizer time using ipex.optimize but optimize both the model and the optimizer: 3.3123 seconds (this makes no sense to me)

To repro replace _ by model in this line https://github.com/msaroufim/tinyoptimizer/blob/master/cpu_optimizer/ipex/class.py#L100

I'd like to understand what's the ballpark performance improvement I can expect from using fused CPU ADAM is it around 10% or closer to 2x for my microbenchmark and should I expect this pattern to change at larger model sizes

Mar 28 '24 01:03 msaroufim

Fused Adam optimizer time using ipex.optimize but only optimize the optimizer: 2.7120 seconds Fused Adam optimizer time using ipex.optimize but optimize both the model and the optimizer: 3.3123 seconds (this makes no sense to me)

I don't think it is expected and guess there is something else going on here. Do you have the profiler info and perhaps we can look into the problem with it?

I'd like to understand what's the ballpark performance improvement I can expect from using fused CPU ADAM is it around 10% or closer to 2x for my microbenchmark and should I expect this pattern to change at larger model sizes

If we talk about the Adam optimizer alone, 2x makes more sense to me with fused one but it depends on the model sizes, the larger model sizes, the more benefit we get from fusion.

Mar 28 '24 01:03 jgong5

cc @zhuhaozhe

Mar 28 '24 01:03 jgong5

I don't have any profile data available but the results were reliably repro-ing in the repro I linked in the original message. Let me know if there's any other info I can provide to make debugging this easier

Mar 28 '24 05:03 msaroufim

Profiling. Thanks!

Mar 28 '24 05:03 sanchitintel

Hi @msaroufim,

Upon changing model, optimizer = ipex.optimize(model = model, optimizer=torch.optim.Adam(model.parameters())) to _, optimizer = ipex.optimize(model = model, optimizer=torch.optim.Adam(model.parameters())), at my end, I did see that the former was ~10% slower than the latter for the model you used (only one linear layer), but the difference wasn't as significant as what you encountered.

Nevertheless, we'll try to fix this regression. Thanks!

Mar 28 '24 09:03 sanchitintel

Investigating why 4 was faster than 2 or 3

Mar 28 '24 16:03 sanchitintel

Hi @msaroufim, when ipex.optimize is used, _copy_model_and_optimizer is called, if the model & optimizer can't be modified inplace, which is the default case.

This method is responsible for the speedup when ipex.optimize is used with fused Adam (datapoint 4 in the description, not referring to FusedCPUAdam), as opposed to datapoints 2 or 3, in which case, this method is not called.

I'll figure out what precisely in this method is resulting in a speedup.

Thanks!

Mar 28 '24 20:03 sanchitintel

Rather non-intuitively, deep-copying the optimizer results in the ~10% speedup for 4 over 2/3. I verified this hypothesis by simply commenting out most of the code in _copy_model_and_optimizer.

https://github.com/intel/intel-extension-for-pytorch/blob/main/intel_extension_for_pytorch/frontend.py#L46

@jgong5 @zhuhaozhe, can you please elaborate on why that'd result in a speedup? Thanks!

Mar 28 '24 21:03 sanchitintel

@jgong5 @zhuhaozhe @Guobing-Chen, one remaining issue is datapoint 1 being faster than datapoint 6 (i.e. PyTorch eager mode being faster than torch.compile for unfused Adam optimizer), which might also result in eager mode fused Adam optimizer being faster than its torch.compile counterpart (after fused Adam optimizer would be enabled in PyTorch).

Apr 02 '24 22:04 sanchitintel

Setting OMP_NUM_THREADS & MKL_NUM_THREADS environment variables (or using torch.set_num_threads) reduces the gap between 1 & 6 but doesn't eliminate it.

I used something like this (In lscpu output, cores 0-15 were on the same socket, i.e. I only used one of the two logical cores per physical core) -

OMP_NUM_THREADS=16 MKL_NUM_THREADS=16 numactl --membind=0 --cpunodebind=0 -C 0-15 python script_name.py

I had also preloaded Intel OpenMP (instead of GNU libgomp) & tcmalloc.

Benchmarking results with torch.compile (datapoint 6): https://gist.github.com/sanchitintel/c2ccda7bdd58be9c12ecf16fa4680f25 Benchmarking results with eager-mode (datapoint 1): https://gist.github.com/sanchitintel/8789298ee88b013c2bfb4b99b36e22ef

@jgong5, with torch.compile, the bottleneck seems to be Torch compiled region, despite using torch._inductor.config.cpp.enable_kernel_profile=True.

Apr 03 '24 00:04 sanchitintel

@msaroufim @jgong5,

There are graph breaks with torch.compile when an unfused optimizer is used. That's resulting in the overhead.

Apr 03 '24 01:04 sanchitintel

Hi @msaroufim, these graph breaks are being used in PyTorch source-code. As per https://github.com/pytorch/pytorch/issues/104053, they will be removed when solution 3 in that ticket (the entire graph is an inference graph) will be implemented. Thanks!

@jgong5 @Guobing-Chen, Dynamo logs pertaining to graph breaks are at https://gist.github.com/sanchitintel/05b19b6d162cf5cdf5dbb174c51962ec. They were collected with the environment variable TORCH_LOGS="+dynamo". Is a workaround possible? Otherwise, after fused Adam optimizer would be enabled in PyTorch, training with eager mode fused Adam optimizer may be faster than training with torch.compile.

Thanks!

Apr 03 '24 18:04 sanchitintel

Hi, @msaroufim, cc @sanchitintel. For the ipex-fused-optimizer, we actually expect user to auto apply it by ipex.optimize instead of directly use the fused_adam_step or optimizer_fusion ( We may should rename it to _optimizer_fusion and _fused_adam_step to avoid misunderstanding.) To only benchmark the optimizer, we have provided an example here https://github.com/intel/intel-extension-for-pytorch/tree/main/tests/cpu/bench/custom_op_bench#evaluate-ipex-fused-optimizer.

Btw, we have already upstream fused adam/adamw/adagrad/sgd into Pytorch, do you need more helps here? https://github.com/pytorch/pytorch/pull/124905 https://github.com/pytorch/pytorch/pull/123629 https://github.com/pytorch/pytorch/pull/123074

Jun 14 '24 02:06 zhuhaozhe

I'm quite happy with the new fused eager kernel that's been upstreamed to PyTorch. Still not sure why compile is so slow though so will @jgong5 decide where he wants to track this

Jun 14 '24 03:06 msaroufim

Hi, @msaroufim. I have some benchmark results to compare fused/non-fused/compile, https://github.com/zhuhaozhe/Misc/blob/main/bench-fused-optimizer/bench-result.md. For compile, I found when the number of parameter get larger, the compile results will worse, I have try some approach to manually modified the generated code but no insights are found yet. And we will keep tracking it. https://github.com/pytorch/pytorch/issues/123238 I will updated it after we have more findings.

Jun 14 '24 03:06 zhuhaozhe

Ok sounds good will close this in favor of the issue in PyTorch

Jun 14 '24 03:06 msaroufim

intel-extension-for-pytorch intel-extension-for-pytorch copied to clipboard

Fused CPU Adam performance

Describe the issue

intel-extension-for-pytorch
intel-extension-for-pytorch copied to clipboard