loopy Codegen slow due to excessive islpy calls

Profile log, as reported by cProfile, links 90% of the codegen time to islpy routines:

         3024863 function calls (2382326 primitive calls) in 605.951 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
790411/302110  477.026    0.001  477.870    0.002 islpy/islpy/__init__.py:861(wrapper)
      265   83.808    0.316   83.884    0.317 islpy/islpy/__init__.py:1014(obj_eliminate_except)
18792/17160   11.141    0.001   11.482    0.001 loopy/loopy/isl_helpers.py:527(_find_aff_dims)
       68    8.268    0.122   19.432    0.286 loopy/loopy/symbolic.py:1884(get_access_range)
       74    6.624    0.090  233.594    3.157 loopy/loopy/isl_helpers.py:614(dim_max_with_elimination)
       74    6.144    0.083  233.079    3.150 loopy/loopy/isl_helpers.py:609(dim_min_with_elimination)
     4608    2.248    0.000    8.858    0.002 islpy/islpy/__init__.py:1100(_align_dim_type)
     1070    1.455    0.001    2.364    0.002 loopy/loopy/symbolic.py:1590(__init__)
  973/897    0.917    0.001    1.262    0.001 loopy/loopy/symbolic.py:1525(aff_to_expr)
    35845    0.646    0.000    0.646    0.000 {built-in method islpy._isl.from_aff}
    49830    0.596    0.000    1.861    0.000 islpy/islpy/__init__.py:1096(_set_dim_id)

Oct 01 '20 18:10 kaushikcfd

Thanks for timing this! A few thoughts:

At least in a way, this is good news, because it means our choice of implementation language isn't slowing us down much. :)
I'll further assume that it's time genuinely spent in isl and not in the wrapper. (Do you have any indication otherwise?)
On the other hand, it means that we have only two ways to do act on this:
- Call isl less (mabye through caching?)
- Make isl itself faster (One possible way to do so would be by activating "small-integer optimization", which (unfortunately) is a build-time option in isl)

Oct 01 '20 18:10 inducer

I'll further assume that it's time genuinely spent in isl and not in the wrapper. (Do you have any indication otherwise?)

~~Yes: https://github.com/inducer/islpy/issues/28~~

Oct 01 '20 19:10 inducer

From the discussing at inducer/islpy#28, it was established that ISL's performance depends on whether it was linked against GMP or imath (shipped with islpy). Running the same kernel with islpy linked with GMP gets a 3x speed-up! :)

Here's the corresponding log:

         2897406 function calls (2299915 primitive calls) in 220.275 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
729855/286691  164.491    0.000  165.192    0.001 islpy/islpy/__init__.py:861(wrapper)
      265   25.103    0.095   25.174    0.095 islpy/islpy/__init__.py:1014(obj_eliminate_except)
17736/16152   10.725    0.001   11.068    0.001 loopy/loopy/isl_helpers.py:527(_find_aff_dims)
       68    3.318    0.049    8.032    0.118 loopy/loopy/symbolic.py:1884(get_access_range)
       74    2.561    0.035   82.331    1.113 loopy/loopy/isl_helpers.py:614(dim_max_with_elimination)
       74    2.357    0.032   82.178    1.111 loopy/loopy/isl_helpers.py:609(dim_min_with_elimination)
     4632    1.929    0.000    7.909    0.002 islpy/islpy/__init__.py:1100(_align_dim_type)
     1070    1.403    0.001    2.342    0.002 loopy/loopy/symbolic.py:1590(__init__)
  973/897    0.937    0.001    1.296    0.001 loopy/loopy/symbolic.py:1525(aff_to_expr)
    33697    0.732    0.000    0.732    0.000 {built-in method islpy._isl.from_aff}
    46890    0.593    0.000    1.867    0.000 islpy/islpy/__init__.py:1096(_set_dim_id)
    27169    0.508    0.000    0.871    0.000 islpy/islpy/__init__.py:850(wrapper)
     3347    0.500    0.000    0.834    0.000 islpy/islpy/__init__.py:690(_number_to_expr_like)
    22345    0.419    0.000    0.419    0.000 islpy/islpy/__init__.py:548(set_get_basic_sets)
    10367    0.380    0.000    2.568    0.000 islpy/islpy/__init__.py:315(space_get_var_dict)
      148    0.353    0.002   11.606    0.078 loopy/loopy/isl_helpers.py:550(_find_noninteracting_dims)
    21703    0.311    0.000    0.311    0.000 islpy/islpy/__init__.py:805(val_to_python)

There's clearly more stuff to be fixed. (exploring if we are unnecessarily invoking ISL routines anywhere.)

Oct 01 '20 20:10 kaushikcfd

cc @wence- (who may be interested in ways to speed up loopy, but see the license concerns at https://github.com/inducer/islpy/issues/28)

Oct 01 '20 20:10 inducer

Post #149, #150 and taking suggestions from inducer/islpy#28, the codegen profile looks better:

         3501511 function calls (2873484 primitive calls) in 71.862 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    30039   26.195    0.001   26.610    0.001 islpy/islpy/__init__.py:850(wrapper)
735171/248679   19.323    0.000   20.050    0.000 islpy/islpy/__init__.py:861(wrapper)
       48    7.116    0.148    7.129    0.149 islpy/islpy/__init__.py:1014(obj_eliminate_except)
       94    4.831    0.051   11.872    0.126 loopy/loopy/symbolic.py:1652(get_access_range)
     21/2    2.394    0.114   54.000   27.000 loopy/loopy/codegen/loop.py:347(generate_sequential_loop_dim_code)
     5073    2.009    0.000    8.444    0.002 islpy/islpy/__init__.py:1100(_align_dim_type)
     1200    1.560    0.001    2.617    0.002 loopy/loopy/symbolic.py:1354(__init__)
 1005/925    1.071    0.001    1.460    0.002 loopy/loopy/symbolic.py:1289(aff_to_expr)

We are still doing around 0.3M ISL calls which is not ideal, but I'm fine with closing this issue.

Oct 07 '20 22:10 kaushikcfd

Let's leave it open for a bit. I think there's more we can do.

Oct 08 '20 00:10 inducer

Profile guided optimization while compiling isl might modestly improve performance.

Nov 21 '23 18:11 nchristensen

loopy loopy copied to clipboard

Codegen slow due to excessive islpy calls

loopy
loopy copied to clipboard