loopy
loopy copied to clipboard
Loopy is slow in make_kernel, preprocess_kernel and codegen
Thanks to @kaushikcfd, scheduling is now super fast compared to other parts of loopy. Still, make_kernel, preprocess_kernel, codegen take so much time that some sumpy kernels are unusable.
Here's a small example with https://github.com/isuruf/sumpy/tree/derivtaker
import numpy as np
import sys
import loopy as lp
import pyopencl as cl
from sumpy.expansion.multipole import LaplaceConformingVolumeTaylorMultipoleExpansion
from sumpy.expansion.local import LaplaceConformingVolumeTaylorLocalExpansion
from sumpy.kernel import LaplaceKernel
import sumpy.symbolic as sym
import logging
logger = logging.getLogger(__name__)
try:
import faulthandler
except ImportError:
pass
else:
faulthandler.enable()
knl = LaplaceKernel(3)
local_expn_class = LaplaceConformingVolumeTaylorLocalExpansion
mpole_expn_class = LaplaceConformingVolumeTaylorMultipoleExpansion
order = 12
ctx_factory = cl._csc
logging.basicConfig(level=logging.INFO)
from sympy.core.cache import clear_cache
clear_cache()
ctx = ctx_factory()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
np.random.seed(17)
target_kernels = [knl]
m_expn = mpole_expn_class(knl, order=order)
l_expn = local_expn_class(knl, order=order)
from sumpy import P2EFromSingleBox, E2PFromSingleBox, P2P, E2EFromCSR
m2l = E2EFromCSR(ctx, m_expn, l_expn)
loopy_knl = m2l.get_optimized_kernel()
loopy_knl = lp.add_and_infer_dtypes(
loopy_knl,
dict(
tgt_ibox=np.int32,
centers=np.float64,
tgt_center=np.float64,
target_boxes=np.int32,
src_ibox=np.int32,
src_expansions=np.float64,
tgt_rscale=np.float64,
src_rscale=np.float64,
src_box_starts=np.int32,
src_box_lists=np.int32,
),
)
lp.generate_code_v2(loopy_knl)
This is the log sorted by the cumulative time spent. There doesn't seem to be an obvious low hanging fruit in this case:
146881842 function calls (140188954 primitive calls) in 97.556 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.019 0.019 96.991 96.991 loopy/loopy/codegen/__init__.py:404(generate_code_v2)
1588528 1.643 0.000 41.446 0.000 pymbolic/pymbolic/mapper/__init__.py:109(__call__)
1 0.000 0.000 36.843 36.843 loopy/loopy/schedule/__init__.py:2134(get_one_scheduled_kernel)
1 0.000 0.000 36.843 36.843 loopy/loopy/schedule/__init__.py:2143(get_one_linearized_kernel)
1 0.000 0.000 36.842 36.842 loopy/loopy/schedule/__init__.py:2121(_get_one_scheduled_kernel_inner)
2 0.003 0.001 36.842 18.421 loopy/loopy/schedule/__init__.py:1945(generate_loop_schedules_inner)
1 0.012 0.012 36.569 36.569 loopy/loopy/preprocess.py:2030(preprocess_kernel)
107846 0.113 0.000 36.025 0.001 {built-in method builtins.next}
2 0.000 0.000 35.952 17.976 loopy/loopy/schedule/__init__.py:1929(generate_loop_schedules)
9335052 2.563 0.000 32.603 0.000 pytools/__init__.py:675(wrapper)
1 0.000 0.000 30.326 30.326 loopy/loopy/transform/iname.py:1218(wrapper)
1 0.084 0.084 28.915 28.915 loopy/loopy/preprocess.py:881(realize_reduction)
507 0.001 0.000 23.352 0.046 loopy/loopy/symbolic.py:1815(map_reduction)
169 0.002 0.000 23.348 0.138 loopy/loopy/preprocess.py:1690(map_reduction)
169 0.006 0.000 23.318 0.138 loopy/loopy/preprocess.py:1004(map_reduction_seq)
7868 14.844 0.002 23.292 11.646 loopy/loopy/schedule/__init__.py:807(generate_loop_schedules_internal)
169 0.000 0.000 23.278 0.138 loopy/loopy/kernel/tools.py:1655(find_most_recent_global_barrier)
169 3.350 0.020 23.087 0.137 loopy/loopy/kernel/tools.py:1590(get_global_barrier_order)
If anyone wishes to reproduce this, here is the script.
Here's the pyinstrument profile,
136.566 <module> loopy_reproduce.py:1
├─ 73.198 generate_code_v2 loopy/codegen/__init__.py:404
│ ├─ 32.329 preprocess_kernel loopy/preprocess.py:2030
│ │ ├─ 26.890 wrapper loopy/transform/iname.py:1218
│ │ │ └─ 25.818 realize_reduction loopy/preprocess.py:881
│ │ │ ├─ 22.571 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [162 frames hidden] pymbolic
│ │ │ │ 21.620 map_reduction loopy/symbolic.py:1815
│ │ │ │ └─ 21.620 map_reduction loopy/preprocess.py:1690
│ │ │ │ └─ 21.579 map_reduction_seq loopy/preprocess.py:1004
│ │ │ │ └─ 21.565 wrapper pytools/__init__.py:675
│ │ │ │ └─ 21.563 find_most_recent_global_barrier loopy/kernel/tools.py:1655
│ │ │ │ └─ 21.562 wrapper pytools/__init__.py:675
│ │ │ │ └─ 21.314 get_global_barrier_order loopy/kernel/tools.py:1590
│ │ │ │ ├─ 11.304 compute_topological_order pytools/graph.py:210
│ │ │ │ │ ├─ 7.439 [self]
│ │ │ │ │ ├─ 1.473 __lt__ pytools/graph.py:206
│ │ │ │ │ └─ 1.424 dict.get <built-in>:0
│ │ │ │ ├─ 3.834 <listcomp> loopy/kernel/tools.py:1606
│ │ │ │ │ └─ 3.571 _is_global_barrier loopy/kernel/tools.py:1583
│ │ │ │ ├─ 2.835 [self]
│ │ │ │ └─ 2.336 <dictcomp> loopy/kernel/tools.py:1597
│ │ │ └─ 3.122 replace_instruction_ids loopy/transform/instruction.py:172
│ │ │ └─ 2.455 [self]
│ │ └─ 1.811 realize_ilp loopy/preprocess.py:1965
│ │ └─ 1.811 privatize_temporaries_with_inames loopy/transform/privatize.py:72
│ ├─ 24.813 generate_host_or_device_program loopy/codegen/result.py:286
│ │ └─ 24.804 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 24.702 build_insn_group loopy/codegen/control.py:330
│ │ └─ 24.702 gen_code loopy/codegen/control.py:456
│ │ └─ 24.702 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 24.675 generate_host_or_device_program loopy/codegen/result.py:286
│ │ └─ 23.941 set_up_hw_parallel_loops loopy/codegen/loop.py:231
│ │ └─ 23.894 set_up_hw_parallel_loops loopy/codegen/loop.py:231
│ │ └─ 23.883 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 23.807 build_insn_group loopy/codegen/control.py:330
│ │ └─ 23.786 gen_code loopy/codegen/control.py:456
│ │ └─ 23.786 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 23.786 generate_sequential_loop_dim_code loopy/codegen/loop.py:347
│ │ └─ 23.766 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 23.702 build_insn_group loopy/codegen/control.py:330
│ │ └─ 23.447 build_insn_group loopy/codegen/control.py:330
│ │ └─ 23.415 build_insn_group loopy/codegen/control.py:330
│ │ └─ 22.938 gen_code loopy/codegen/control.py:456
│ │ └─ 22.938 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 22.938 generate_sequential_loop_dim_code loopy/codegen/loop.py:347
│ │ └─ 22.911 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 22.466 build_insn_group loopy/codegen/control.py:330
│ │ ├─ 13.254 gen_code loopy/codegen/control.py:456
│ │ │ └─ 13.243 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ │ └─ 13.147 try_vectorized loopy/codegen/__init__.py:336
│ │ │ └─ 13.139 <lambda> loopy/codegen/control.py:170
│ │ │ └─ 13.138 generate_instruction_code loopy/codegen/instruction.py:74
│ │ │ ├─ 11.017 to_codegen_result loopy/codegen/instruction.py:34
│ │ │ │ ├─ 6.939 align_two islpy/__init__.py:1224
│ │ │ │ │ [220 frames hidden] islpy
│ │ │ │ └─ 3.207 wrapper islpy/__init__.py:911
│ │ │ │ [68 frames hidden] islpy
│ │ │ │ 3.078 gist islpy/_isl.py:59605
│ │ │ │ └─ 2.801 Lib.isl_set_gist <built-in>:0
│ │ │ └─ 2.024 generate_assignment_instruction_code loopy/codegen/instruction.py:102
│ │ │ └─ 1.813 emit_assignment loopy/target/c/__init__.py:868
│ │ │ └─ 1.575 __call__ loopy/target/c/codegen/expression.py:118
│ │ │ └─ 1.506 rec loopy/target/c/codegen/expression.py:110
│ │ └─ 9.191 build_insn_group loopy/codegen/control.py:330
│ │ └─ 9.151 build_insn_group loopy/codegen/control.py:330
│ │ └─ 9.110 build_insn_group loopy/codegen/control.py:330
│ │ └─ 9.108 gen_code loopy/codegen/control.py:456
│ │ └─ 9.106 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 9.034 try_vectorized loopy/codegen/__init__.py:336
│ │ └─ 9.030 <lambda> loopy/codegen/control.py:170
│ │ └─ 9.028 generate_instruction_code loopy/codegen/instruction.py:74
│ │ ├─ 5.083 to_codegen_result loopy/codegen/instruction.py:34
│ │ │ └─ 3.875 align_two islpy/__init__.py:1224
│ │ │ [221 frames hidden] islpy
│ │ └─ 3.887 generate_assignment_instruction_code loopy/codegen/instruction.py:102
│ │ └─ 3.723 emit_assignment loopy/target/c/__init__.py:868
│ │ └─ 3.603 __call__ loopy/target/c/codegen/expression.py:118
│ │ └─ 3.572 rec loopy/target/c/codegen/expression.py:110
│ │ ├─ 2.043 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ │ 1.668 map_sum loopy/target/c/codegen/expression.py:561
│ │ │ └─ 1.667 base_impl loopy/target/c/codegen/expression.py:562
│ │ │ └─ 1.667 map_sum pymbolic/mapper/__init__.py:398
│ │ │ [16 frames hidden] pymbolic
│ │ │ 1.599 <genexpr> pymbolic/mapper/__init__.py:401
│ │ │ └─ 1.578 rec loopy/target/c/codegen/expression.py:110
│ │ │ └─ 1.564 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ │ 1.526 map_product loopy/target/c/codegen/expression.py:610
│ │ │ └─ 1.518 base_impl loopy/target/c/codegen/expression.py:611
│ │ │ └─ 1.494 map_product pymbolic/mapper/__init__.py:403
│ │ │ [32 frames hidden] pymbolic
│ │ └─ 1.514 infer_type loopy/target/c/codegen/expression.py:78
│ │ └─ 1.483 __call__ loopy/type_inference.py:60
│ │ └─ 1.472 __call__ pymbolic/mapper/__init__.py:114
│ │ [2 frames hidden] pymbolic
│ │ 1.468 map_sum loopy/type_inference.py:170
│ └─ 14.247 get_one_scheduled_kernel loopy/schedule/__init__.py:2134
│ └─ 14.247 get_one_linearized_kernel loopy/schedule/__init__.py:2143
│ └─ 14.246 _get_one_scheduled_kernel_inner loopy/schedule/__init__.py:2121
│ └─ 14.206 generate_loop_schedules loopy/schedule/__init__.py:1929
│ └─ 14.206 generate_loop_schedules_inner loopy/schedule/__init__.py:1945
│ ├─ 10.221 pre_schedule_checks loopy/check.py:799
│ │ ├─ 5.407 check_variable_access_ordered loopy/check.py:762
│ │ │ └─ 5.407 _check_variable_access_ordered_inner loopy/check.py:604
│ │ │ └─ 3.656 do_access_ranges_overlap_conservative loopy/symbolic.py:2194
│ │ │ └─ 2.114 _get_access_range_for_var loopy/symbolic.py:2179
│ │ │ └─ 1.982 wrapper pytools/__init__.py:675
│ │ │ └─ 1.930 _get_access_ranges loopy/symbolic.py:2154
│ │ │ └─ 1.819 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ │ 1.817 map_subscript loopy/symbolic.py:2049
│ │ │ └─ 1.765 get_access_map loopy/symbolic.py:1906
│ │ ├─ 1.720 check_for_integer_subscript_indices loopy/check.py:114
│ │ │ └─ 1.679 __call__ loopy/type_inference.py:60
│ │ │ └─ 1.671 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [3 frames hidden] pymbolic
│ │ │ 1.650 map_sum loopy/type_inference.py:170
│ │ │ └─ 1.448 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ └─ 1.610 check_bounds loopy/check.py:460
│ └─ 2.507 insert_barriers loopy/schedule/__init__.py:1776
│ └─ 2.125 insert_barriers loopy/schedule/__init__.py:1776
│ └─ 1.438 insert_barriers_at_outer_level loopy/schedule/__init__.py:1789
├─ 52.686 get_optimized_kernel sumpy/e2e.py:127
│ ├─ 47.370 get_kernel sumpy/e2e.py:146
│ │ ├─ 25.594 make_kernel loopy/kernel/creation.py:1821
│ │ │ ├─ 7.046 duplicate_inames loopy/transform/iname.py:818
│ │ │ │ ├─ 3.649 map_kernel loopy/symbolic.py:995
│ │ │ │ │ └─ 3.645 <listcomp> loopy/symbolic.py:1000
│ │ │ │ │ └─ 3.568 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ │ └─ 2.990 <lambda> loopy/symbolic.py:1002
│ │ │ │ │ └─ 2.975 __call__ loopy/symbolic.py:981
│ │ │ │ │ └─ 2.742 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ │ [184 frames hidden] pymbolic
│ │ │ │ └─ 3.377 finish_kernel loopy/symbolic.py:899
│ │ │ │ └─ 3.376 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ │ │ │ └─ 3.376 <listcomp> loopy/symbolic.py:792
│ │ │ │ └─ 3.361 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 2.770 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [171 frames hidden] pymbolic
│ │ │ ├─ 4.475 fix_parameters loopy/transform/parameter.py:134
│ │ │ │ └─ 4.475 _fix_parameter loopy/transform/parameter.py:67
│ │ │ │ ├─ 2.478 map_kernel loopy/symbolic.py:995
│ │ │ │ │ └─ 2.203 <listcomp> loopy/symbolic.py:1000
│ │ │ │ │ └─ 2.192 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ │ └─ 1.891 <lambda> loopy/symbolic.py:1002
│ │ │ │ │ └─ 1.882 __call__ loopy/symbolic.py:981
│ │ │ │ │ └─ 1.780 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ │ [163 frames hidden] pymbolic
│ │ │ │ └─ 1.703 finish_kernel loopy/symbolic.py:899
│ │ │ │ └─ 1.703 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ │ │ │ └─ 1.703 <listcomp> loopy/symbolic.py:792
│ │ │ │ └─ 1.693 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.394 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [168 frames hidden] pymbolic
│ │ │ ├─ 2.530 determine_shapes_of_temporaries loopy/kernel/creation.py:1512
│ │ │ │ └─ 1.912 find_shapes_of_vars loopy/kernel/creation.py:1463
│ │ │ │ └─ 1.880 feed_all_expressions loopy/kernel/creation.py:1523
│ │ │ │ └─ 1.871 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.587 <lambda> loopy/kernel/creation.py:1526
│ │ │ │ └─ 1.584 run_through_armap loopy/kernel/creation.py:1469
│ │ │ │ └─ 1.564 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [218 frames hidden] pymbolic
│ │ │ ├─ 2.095 __init__ loopy/kernel/creation.py:1080
│ │ │ ├─ 1.769 guess_arg_shape_if_requested loopy/kernel/creation.py:1610
│ │ │ │ └─ 1.769 guess_var_shape loopy/kernel/tools.py:985
│ │ │ │ └─ 1.758 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.453 run_through_armap loopy/kernel/tools.py:992
│ │ │ ├─ 1.690 guess_kernel_args_if_requested loopy/kernel/creation.py:1170
│ │ │ │ └─ 1.670 make_new_arg loopy/kernel/creation.py:1132
│ │ │ │ └─ 1.670 find_index_rank loopy/kernel/creation.py:1116
│ │ │ │ └─ 1.660 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.392 run_irf loopy/kernel/creation.py:1119
│ │ │ │ └─ 1.368 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [220 frames hidden] pymbolic
│ │ │ └─ 1.464 expand_cses loopy/kernel/creation.py:1321
│ │ └─ 21.709 get_translation_loopy_insns sumpy/e2e.py:91
│ │ ├─ 16.169 to_loopy_insns sumpy/codegen.py:679
│ │ │ ├─ 8.331 <listcomp> sumpy/codegen.py:731
│ │ │ │ └─ 7.319 convert_expr sumpy/codegen.py:712
│ │ │ │ └─ 7.236 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [187 frames hidden] pymbolic
│ │ │ └─ 5.620 kill_trivial_assignments sumpy/codegen.py:161
│ │ │ ├─ 2.872 substitute pymbolic/mapper/substitutor.py:72
│ │ │ │ [212 frames hidden] pymbolic
│ │ │ │ 1.436 dict.copy <built-in>:0
│ │ │ └─ 1.480 make_one_step_subst sumpy/codegen.py:78
│ │ └─ 4.305 run_global_cse sumpy/assignment_collection.py:164
│ │ └─ 4.291 cse sumpy/cse.py:550
│ │ └─ 3.400 opt_cse sumpy/cse.py:357
│ │ └─ 2.921 match_common_args sumpy/cse.py:266
│ └─ 5.299 split_iname loopy/transform/iname.py:334
│ └─ 5.294 _split_iname_backend loopy/transform/iname.py:211
│ ├─ 2.243 map_kernel loopy/symbolic.py:995
│ │ └─ 1.868 <listcomp> loopy/symbolic.py:1000
│ │ └─ 1.853 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ └─ 1.560 <lambda> loopy/symbolic.py:1002
│ │ └─ 1.552 __call__ loopy/symbolic.py:981
│ │ └─ 1.442 __call__ pymbolic/mapper/__init__.py:114
│ │ [161 frames hidden] pymbolic
│ └─ 1.687 finish_kernel loopy/symbolic.py:899
│ └─ 1.687 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ └─ 1.687 <listcomp> loopy/symbolic.py:792
│ └─ 1.673 with_transformed_expressions loopy/kernel/instruction.py:872
│ └─ 1.375 __call__ pymbolic/mapper/__init__.py:114
│ [173 frames hidden] pymbolic
└─ 8.944 add_and_infer_dtypes loopy/kernel/tools.py:106
└─ 8.937 infer_unknown_types loopy/type_inference.py:485
├─ 4.969 <dictcomp> loopy/type_inference.py:527
│ └─ 4.954 <setcomp> loopy/type_inference.py:528
│ └─ 4.709 [self]
└─ 3.164 _infer_var_type loopy/type_inference.py:407
└─ 1.823 __call__ loopy/type_inference.py:60
└─ 1.812 __call__ pymbolic/mapper/__init__.py:114
[2 frames hidden] pymbolic
1.734 map_sum loopy/type_inference.py:170
└─ 1.523 __call__ pymbolic/mapper/__init__.py:114
[2 frames hidden] pymbolic
After a couple of improvements to loopy and sumpy (derivtaker branch) pyinstrument output is now,
84.866 <module> loopy_reproduce.py:1
├─ 39.403 generate_code_v2 loopy/codegen/__init__.py:404
│ ├─ 16.501 generate_host_or_device_program loopy/codegen/result.py:286
│ │ └─ 16.494 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 16.419 build_insn_group loopy/codegen/control.py:330
│ │ └─ 16.419 gen_code loopy/codegen/control.py:456
│ │ └─ 16.418 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 16.401 generate_host_or_device_program loopy/codegen/result.py:286
│ │ └─ 15.954 set_up_hw_parallel_loops loopy/codegen/loop.py:231
│ │ └─ 15.923 set_up_hw_parallel_loops loopy/codegen/loop.py:231
│ │ └─ 15.916 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 15.869 build_insn_group loopy/codegen/control.py:330
│ │ └─ 15.859 gen_code loopy/codegen/control.py:456
│ │ └─ 15.859 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 15.859 generate_sequential_loop_dim_code loopy/codegen/loop.py:347
│ │ └─ 15.841 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 15.797 build_insn_group loopy/codegen/control.py:330
│ │ └─ 15.188 build_insn_group loopy/codegen/control.py:330
│ │ └─ 15.155 build_insn_group loopy/codegen/control.py:330
│ │ └─ 14.651 gen_code loopy/codegen/control.py:456
│ │ └─ 14.651 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 14.650 generate_sequential_loop_dim_code loopy/codegen/loop.py:347
│ │ └─ 14.629 build_loop_nest loopy/codegen/control.py:218
│ │ └─ 14.391 build_insn_group loopy/codegen/control.py:330
│ │ ├─ 8.649 build_insn_group loopy/codegen/control.py:330
│ │ │ └─ 8.607 build_insn_group loopy/codegen/control.py:330
│ │ │ └─ 8.563 build_insn_group loopy/codegen/control.py:330
│ │ │ └─ 8.561 gen_code loopy/codegen/control.py:456
│ │ │ └─ 8.559 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ │ └─ 8.538 try_vectorized loopy/codegen/__init__.py:336
│ │ │ └─ 8.537 <lambda> loopy/codegen/control.py:170
│ │ │ └─ 8.537 generate_instruction_code loopy/codegen/instruction.py:74
│ │ │ ├─ 5.304 to_codegen_result loopy/codegen/instruction.py:34
│ │ │ │ ├─ 3.620 align_two islpy/__init__.py:1224
│ │ │ │ │ [218 frames hidden] islpy
│ │ │ │ └─ 1.243 wrapper islpy/__init__.py:911
│ │ │ │ [63 frames hidden] islpy
│ │ │ │ 1.194 gist islpy/_isl.py:59605
│ │ │ │ └─ 1.140 Lib.isl_set_gist <built-in>:0
│ │ │ └─ 3.211 generate_assignment_instruction_code loopy/codegen/instruction.py:102
│ │ │ └─ 3.165 emit_assignment loopy/target/c/__init__.py:868
│ │ │ └─ 3.120 __call__ loopy/target/c/codegen/expression.py:118
│ │ │ └─ 3.103 rec loopy/target/c/codegen/expression.py:110
│ │ │ ├─ 1.642 infer_type loopy/target/c/codegen/expression.py:78
│ │ │ │ └─ 1.633 __call__ loopy/type_inference.py:60
│ │ │ │ └─ 1.627 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [2 frames hidden] pymbolic
│ │ │ │ 1.622 map_sum loopy/type_inference.py:170
│ │ │ │ └─ 1.511 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [3 frames hidden] pymbolic
│ │ │ │ 1.432 map_sum loopy/type_inference.py:170
│ │ │ └─ 1.458 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ │ 1.228 map_sum loopy/target/c/codegen/expression.py:561
│ │ │ └─ 1.226 base_impl loopy/target/c/codegen/expression.py:562
│ │ │ └─ 1.226 map_sum pymbolic/mapper/__init__.py:398
│ │ │ [17 frames hidden] pymbolic
│ │ │ 1.143 <genexpr> pymbolic/mapper/__init__.py:401
│ │ │ └─ 1.126 rec loopy/target/c/codegen/expression.py:110
│ │ │ └─ 1.111 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [2 frames hidden] pymbolic
│ │ │ 1.072 map_product loopy/target/c/codegen/expression.py:610
│ │ │ └─ 1.058 base_impl loopy/target/c/codegen/expression.py:611
│ │ │ └─ 1.045 map_product pymbolic/mapper/__init__.py:403
│ │ │ [32 frames hidden] pymbolic
│ │ └─ 5.729 gen_code loopy/codegen/control.py:456
│ │ └─ 5.725 generate_code_for_sched_index loopy/codegen/control.py:67
│ │ └─ 5.707 try_vectorized loopy/codegen/__init__.py:336
│ │ └─ 5.707 <lambda> loopy/codegen/control.py:170
│ │ └─ 5.707 generate_instruction_code loopy/codegen/instruction.py:74
│ │ └─ 5.170 to_codegen_result loopy/codegen/instruction.py:34
│ │ ├─ 3.086 align_two islpy/__init__.py:1224
│ │ │ [219 frames hidden] islpy
│ │ └─ 1.756 wrapper islpy/__init__.py:911
│ │ [50 frames hidden] islpy
│ │ 1.716 gist islpy/_isl.py:59605
│ │ └─ 1.686 Lib.isl_set_gist <built-in>:0
│ ├─ 11.540 get_one_scheduled_kernel loopy/schedule/__init__.py:2134
│ │ └─ 11.540 get_one_linearized_kernel loopy/schedule/__init__.py:2143
│ │ └─ 11.539 _get_one_scheduled_kernel_inner loopy/schedule/__init__.py:2121
│ │ └─ 11.503 generate_loop_schedules loopy/schedule/__init__.py:1929
│ │ └─ 11.503 generate_loop_schedules_inner loopy/schedule/__init__.py:1945
│ │ ├─ 9.040 pre_schedule_checks loopy/check.py:799
│ │ │ ├─ 5.438 check_variable_access_ordered loopy/check.py:762
│ │ │ │ └─ 5.438 _check_variable_access_ordered_inner loopy/check.py:604
│ │ │ │ ├─ 3.505 do_access_ranges_overlap_conservative loopy/symbolic.py:2194
│ │ │ │ │ ├─ 2.047 _get_access_range_for_var loopy/symbolic.py:2179
│ │ │ │ │ │ └─ 1.882 wrapper pytools/__init__.py:675
│ │ │ │ │ │ └─ 1.824 _get_access_ranges loopy/symbolic.py:2154
│ │ │ │ │ │ └─ 1.725 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ │ │ [4 frames hidden] pymbolic
│ │ │ │ │ │ 1.723 map_subscript loopy/symbolic.py:2049
│ │ │ │ │ │ └─ 1.695 get_access_map loopy/symbolic.py:1906
│ │ │ │ │ │ └─ 0.969 guarded_aff_from_expr loopy/symbolic.py:1514
│ │ │ │ │ │ └─ 0.965 with_aff_conversion_guard loopy/symbolic.py:1492
│ │ │ │ │ │ └─ 0.891 aff_from_expr loopy/symbolic.py:1473
│ │ │ │ │ │ └─ 0.870 pwaff_from_expr loopy/symbolic.py:1488
│ │ │ │ │ └─ 1.257 obj_and islpy/__init__.py:295
│ │ │ │ │ [38 frames hidden] islpy
│ │ │ │ └─ 0.968 discard_dep_reqs_in_order loopy/check.py:663
│ │ │ ├─ 1.354 check_for_integer_subscript_indices loopy/check.py:114
│ │ │ │ └─ 1.321 __call__ loopy/type_inference.py:60
│ │ │ │ └─ 1.315 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [3 frames hidden] pymbolic
│ │ │ │ 1.289 map_sum loopy/type_inference.py:170
│ │ │ │ └─ 1.128 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [3 frames hidden] pymbolic
│ │ │ │ 1.038 map_sum loopy/type_inference.py:170
│ │ │ └─ 1.156 check_bounds loopy/check.py:460
│ │ └─ 1.758 insert_barriers loopy/schedule/__init__.py:1776
│ │ └─ 1.550 insert_barriers loopy/schedule/__init__.py:1776
│ │ └─ 1.078 insert_barriers_at_outer_level loopy/schedule/__init__.py:1789
│ └─ 10.188 preprocess_kernel loopy/preprocess.py:2030
│ ├─ 6.725 wrapper loopy/transform/iname.py:1218
│ │ └─ 5.887 realize_reduction loopy/preprocess.py:881
│ │ ├─ 3.210 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [154 frames hidden] pymbolic
│ │ │ 2.462 map_reduction loopy/symbolic.py:1815
│ │ │ └─ 2.462 map_reduction loopy/preprocess.py:1690
│ │ │ └─ 2.445 map_reduction_seq loopy/preprocess.py:1004
│ │ │ └─ 2.427 wrapper pytools/__init__.py:675
│ │ │ └─ 2.426 find_most_recent_global_barrier loopy/kernel/tools.py:1655
│ │ │ └─ 2.171 <genexpr> loopy/kernel/tools.py:1670
│ │ │ └─ 1.903 _is_global_barrier loopy/kernel/tools.py:1583
│ │ └─ 2.525 replace_instruction_ids loopy/transform/instruction.py:172
│ │ └─ 1.875 [self]
│ ├─ 1.383 realize_ilp loopy/preprocess.py:1965
│ │ └─ 1.383 privatize_temporaries_with_inames loopy/transform/privatize.py:72
│ │ └─ 1.258 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ └─ 1.077 __call__ pymbolic/mapper/__init__.py:114
│ │ [136 frames hidden] pymbolic
│ └─ 0.937 check_reduction_iname_uniqueness loopy/preprocess.py:95
│ └─ 0.933 with_transformed_expressions loopy/kernel/instruction.py:872
├─ 35.939 get_optimized_kernel sumpy/e2e.py:127
│ ├─ 32.290 get_kernel sumpy/e2e.py:146
│ │ ├─ 18.293 make_kernel loopy/kernel/creation.py:1821
│ │ │ ├─ 4.981 duplicate_inames loopy/transform/iname.py:818
│ │ │ │ ├─ 2.807 map_kernel loopy/symbolic.py:995
│ │ │ │ │ └─ 2.803 <listcomp> loopy/symbolic.py:1000
│ │ │ │ │ └─ 2.763 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ │ └─ 2.414 <lambda> loopy/symbolic.py:1002
│ │ │ │ │ └─ 2.400 __call__ loopy/symbolic.py:981
│ │ │ │ │ └─ 2.252 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ │ [181 frames hidden] pymbolic
│ │ │ │ └─ 2.158 finish_kernel loopy/symbolic.py:899
│ │ │ │ └─ 2.157 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ │ │ │ └─ 2.157 <listcomp> loopy/symbolic.py:792
│ │ │ │ └─ 2.152 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.817 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [181 frames hidden] pymbolic
│ │ │ ├─ 3.111 fix_parameters loopy/transform/parameter.py:134
│ │ │ │ └─ 3.111 _fix_parameter loopy/transform/parameter.py:67
│ │ │ │ ├─ 1.871 map_kernel loopy/symbolic.py:995
│ │ │ │ │ └─ 1.707 <listcomp> loopy/symbolic.py:1000
│ │ │ │ │ └─ 1.703 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ │ └─ 1.503 <lambda> loopy/symbolic.py:1002
│ │ │ │ │ └─ 1.496 __call__ loopy/symbolic.py:981
│ │ │ │ │ └─ 1.422 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ │ [158 frames hidden] pymbolic
│ │ │ │ └─ 1.070 finish_kernel loopy/symbolic.py:899
│ │ │ │ └─ 1.070 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ │ │ │ └─ 1.070 <listcomp> loopy/symbolic.py:792
│ │ │ │ └─ 1.064 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 0.877 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [158 frames hidden] pymbolic
│ │ │ ├─ 1.760 determine_shapes_of_temporaries loopy/kernel/creation.py:1512
│ │ │ │ └─ 1.408 find_shapes_of_vars loopy/kernel/creation.py:1463
│ │ │ │ └─ 1.387 feed_all_expressions loopy/kernel/creation.py:1523
│ │ │ │ └─ 1.384 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.204 <lambda> loopy/kernel/creation.py:1526
│ │ │ │ └─ 1.203 run_through_armap loopy/kernel/creation.py:1469
│ │ │ │ └─ 1.196 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [195 frames hidden] pymbolic
│ │ │ ├─ 1.663 __init__ loopy/kernel/creation.py:1080
│ │ │ │ └─ 0.874 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [165 frames hidden] pymbolic
│ │ │ ├─ 1.254 guess_arg_shape_if_requested loopy/kernel/creation.py:1610
│ │ │ │ └─ 1.254 guess_var_shape loopy/kernel/tools.py:985
│ │ │ │ └─ 1.248 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.087 run_through_armap loopy/kernel/tools.py:992
│ │ │ ├─ 1.226 guess_kernel_args_if_requested loopy/kernel/creation.py:1170
│ │ │ │ └─ 1.215 make_new_arg loopy/kernel/creation.py:1132
│ │ │ │ └─ 1.215 find_index_rank loopy/kernel/creation.py:1116
│ │ │ │ └─ 1.212 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ │ │ └─ 1.059 run_irf loopy/kernel/creation.py:1119
│ │ │ │ └─ 1.045 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [202 frames hidden] pymbolic
│ │ │ └─ 1.146 expand_cses loopy/kernel/creation.py:1321
│ │ │ └─ 0.979 __call__ pymbolic/mapper/__init__.py:114
│ │ │ [154 frames hidden] pymbolic
│ │ └─ 13.927 get_translation_loopy_insns sumpy/e2e.py:91
│ │ ├─ 9.054 to_loopy_insns sumpy/codegen.py:672
│ │ │ ├─ 6.263 <listcomp> sumpy/codegen.py:724
│ │ │ │ └─ 5.732 convert_expr sumpy/codegen.py:705
│ │ │ │ └─ 5.685 __call__ pymbolic/mapper/__init__.py:114
│ │ │ │ [175 frames hidden] pymbolic
│ │ │ └─ 1.098 kill_trivial_assignments sumpy/codegen.py:154
│ │ │ └─ 1.074 substitute pymbolic/mapper/substitutor.py:72
│ │ │ [168 frames hidden] pymbolic
│ │ ├─ 3.728 run_global_cse sumpy/assignment_collection.py:177
│ │ │ └─ 3.720 cse sumpy/cse.py:550
│ │ │ └─ 2.980 opt_cse sumpy/cse.py:357
│ │ │ └─ 2.582 match_common_args sumpy/cse.py:266
│ │ │ └─ 0.898 get_subset_candidates sumpy/cse.py:218
│ │ └─ 1.122 translate_from sumpy/expansion/local.py:182
│ └─ 3.638 split_iname loopy/transform/iname.py:334
│ └─ 3.635 _split_iname_backend loopy/transform/iname.py:211
│ ├─ 1.407 map_kernel loopy/symbolic.py:995
│ │ └─ 1.192 <listcomp> loopy/symbolic.py:1000
│ │ └─ 1.189 with_transformed_expressions loopy/kernel/instruction.py:872
│ │ └─ 1.023 <lambda> loopy/symbolic.py:1002
│ │ └─ 1.019 __call__ loopy/symbolic.py:981
│ │ └─ 0.955 __call__ pymbolic/mapper/__init__.py:114
│ │ [148 frames hidden] pymbolic
│ └─ 1.227 finish_kernel loopy/symbolic.py:899
│ └─ 1.227 rename_subst_rules_in_instructions loopy/symbolic.py:788
│ └─ 1.227 <listcomp> loopy/symbolic.py:792
│ └─ 1.219 with_transformed_expressions loopy/kernel/instruction.py:872
│ └─ 1.046 __call__ pymbolic/mapper/__init__.py:114
│ [151 frames hidden] pymbolic
└─ 7.943 add_and_infer_dtypes loopy/kernel/tools.py:106
└─ 7.939 infer_unknown_types loopy/type_inference.py:485
├─ 5.006 <dictcomp> loopy/type_inference.py:527
│ └─ 4.999 <setcomp> loopy/type_inference.py:528
│ └─ 4.866 [self]
└─ 2.537 _infer_var_type loopy/type_inference.py:407
├─ 1.570 __call__ loopy/type_inference.py:60
│ └─ 1.561 __call__ pymbolic/mapper/__init__.py:114
│ [2 frames hidden] pymbolic
│ 1.289 map_sum loopy/type_inference.py:170
│ └─ 1.123 __call__ pymbolic/mapper/__init__.py:114
│ [2 frames hidden] pymbolic
│ 0.993 map_sum loopy/type_inference.py:170
└─ 0.853 __call__ pymbolic/mapper/__init__.py:114
[153 frames hidden] pymbolic
@inducer, https://github.com/inducer/pymbolic/pull/37 didn't help. Any other suggestions?
align_two call at https://github.com/inducer/loopy/blob/186f5095a54982b7eb2fda5e4b995d7c047fde1e/loopy/codegen/instruction.py#L43 takes a long time.
That's fixed by https://github.com/inducer/loopy/pull/280