loopy
loopy copied to clipboard
Codegen slow due to excessive islpy calls
View the loopy kernel on Gist.
Profile log, as reported by cProfile
, links 90% of the codegen time to islpy routines:
3024863 function calls (2382326 primitive calls) in 605.951 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
790411/302110 477.026 0.001 477.870 0.002 islpy/islpy/__init__.py:861(wrapper)
265 83.808 0.316 83.884 0.317 islpy/islpy/__init__.py:1014(obj_eliminate_except)
18792/17160 11.141 0.001 11.482 0.001 loopy/loopy/isl_helpers.py:527(_find_aff_dims)
68 8.268 0.122 19.432 0.286 loopy/loopy/symbolic.py:1884(get_access_range)
74 6.624 0.090 233.594 3.157 loopy/loopy/isl_helpers.py:614(dim_max_with_elimination)
74 6.144 0.083 233.079 3.150 loopy/loopy/isl_helpers.py:609(dim_min_with_elimination)
4608 2.248 0.000 8.858 0.002 islpy/islpy/__init__.py:1100(_align_dim_type)
1070 1.455 0.001 2.364 0.002 loopy/loopy/symbolic.py:1590(__init__)
973/897 0.917 0.001 1.262 0.001 loopy/loopy/symbolic.py:1525(aff_to_expr)
35845 0.646 0.000 0.646 0.000 {built-in method islpy._isl.from_aff}
49830 0.596 0.000 1.861 0.000 islpy/islpy/__init__.py:1096(_set_dim_id)
Thanks for timing this! A few thoughts:
- At least in a way, this is good news, because it means our choice of implementation language isn't slowing us down much. :)
- I'll further assume that it's time genuinely spent in isl and not in the wrapper. (Do you have any indication otherwise?)
- On the other hand, it means that we have only two ways to do act on this:
- Call isl less (mabye through caching?)
- Make isl itself faster (One possible way to do so would be by activating "small-integer optimization", which (unfortunately) is a build-time option in isl)
I'll further assume that it's time genuinely spent in isl and not in the wrapper. (Do you have any indication otherwise?)
~~Yes: https://github.com/inducer/islpy/issues/28~~
From the discussing at inducer/islpy#28, it was established that ISL's performance depends on whether it was linked against GMP
or imath
(shipped with islpy). Running the same kernel with islpy linked with GMP gets a 3x speed-up! :)
Here's the corresponding log:
2897406 function calls (2299915 primitive calls) in 220.275 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
729855/286691 164.491 0.000 165.192 0.001 islpy/islpy/__init__.py:861(wrapper)
265 25.103 0.095 25.174 0.095 islpy/islpy/__init__.py:1014(obj_eliminate_except)
17736/16152 10.725 0.001 11.068 0.001 loopy/loopy/isl_helpers.py:527(_find_aff_dims)
68 3.318 0.049 8.032 0.118 loopy/loopy/symbolic.py:1884(get_access_range)
74 2.561 0.035 82.331 1.113 loopy/loopy/isl_helpers.py:614(dim_max_with_elimination)
74 2.357 0.032 82.178 1.111 loopy/loopy/isl_helpers.py:609(dim_min_with_elimination)
4632 1.929 0.000 7.909 0.002 islpy/islpy/__init__.py:1100(_align_dim_type)
1070 1.403 0.001 2.342 0.002 loopy/loopy/symbolic.py:1590(__init__)
973/897 0.937 0.001 1.296 0.001 loopy/loopy/symbolic.py:1525(aff_to_expr)
33697 0.732 0.000 0.732 0.000 {built-in method islpy._isl.from_aff}
46890 0.593 0.000 1.867 0.000 islpy/islpy/__init__.py:1096(_set_dim_id)
27169 0.508 0.000 0.871 0.000 islpy/islpy/__init__.py:850(wrapper)
3347 0.500 0.000 0.834 0.000 islpy/islpy/__init__.py:690(_number_to_expr_like)
22345 0.419 0.000 0.419 0.000 islpy/islpy/__init__.py:548(set_get_basic_sets)
10367 0.380 0.000 2.568 0.000 islpy/islpy/__init__.py:315(space_get_var_dict)
148 0.353 0.002 11.606 0.078 loopy/loopy/isl_helpers.py:550(_find_noninteracting_dims)
21703 0.311 0.000 0.311 0.000 islpy/islpy/__init__.py:805(val_to_python)
There's clearly more stuff to be fixed. (exploring if we are unnecessarily invoking ISL routines anywhere.)
cc @wence- (who may be interested in ways to speed up loopy, but see the license concerns at https://github.com/inducer/islpy/issues/28)
Post #149, #150 and taking suggestions from inducer/islpy#28, the codegen profile looks better:
3501511 function calls (2873484 primitive calls) in 71.862 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
30039 26.195 0.001 26.610 0.001 islpy/islpy/__init__.py:850(wrapper)
735171/248679 19.323 0.000 20.050 0.000 islpy/islpy/__init__.py:861(wrapper)
48 7.116 0.148 7.129 0.149 islpy/islpy/__init__.py:1014(obj_eliminate_except)
94 4.831 0.051 11.872 0.126 loopy/loopy/symbolic.py:1652(get_access_range)
21/2 2.394 0.114 54.000 27.000 loopy/loopy/codegen/loop.py:347(generate_sequential_loop_dim_code)
5073 2.009 0.000 8.444 0.002 islpy/islpy/__init__.py:1100(_align_dim_type)
1200 1.560 0.001 2.617 0.002 loopy/loopy/symbolic.py:1354(__init__)
1005/925 1.071 0.001 1.460 0.002 loopy/loopy/symbolic.py:1289(aff_to_expr)
We are still doing around 0.3M ISL calls which is not ideal, but I'm fine with closing this issue.
Let's leave it open for a bit. I think there's more we can do.
Profile guided optimization while compiling isl might modestly improve performance.