clifford Remove workarounds now that sparse-0.9.1 has numba support

Remove workarounds now that sparse-0.9.1 has numba support

Open eric-wieser opened this issue 4 years ago • 11 comments

xref https://github.com/pydata/sparse/pull/307

Jan 23 '20 15:01 eric-wieser

This looks exciting, really nice @eric-wieser ! Is there any impact on performance?

Jan 23 '20 17:01 hugohadfield

Who knows? We don't really have any canonical benchmarks.

Jan 23 '20 17:01 eric-wieser

Mac CI is failing because conda is pinned on a super old version, but I'm tempted to deal with that later.

Jan 23 '20 17:01 eric-wieser

Hmm, seems ~1.4x slower to startup:

With this patch

In [1]: import clifford

In [2]: %timeit clifford.Cl(5)
813 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit clifford.Cl(5)
872 ms ± 47.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Without this patch:

In [1]: import clifford

In [2]: %timeit clifford.Cl(5)
596 ms ± 9.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit clifford.Cl(5)
658 ms ± 40.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@stuartarchibald, can you think of why this would be the case?

Jan 23 '20 17:01 eric-wieser

Multiplication hasn't really changed: After:

In [7]: %timeit e1 * e234
7.69 µs ± 175 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [8]: %timeit e1 * e234
7.48 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

before:

In [9]: %timeit e1 * e234
8.61 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [10]: %timeit e1 * e234
7.91 µs ± 322 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Jan 23 '20 17:01 eric-wieser

Hmm, seems ~1.4x slower to startup:

With this patch

In [1]: import clifford

In [2]: %timeit clifford.Cl(5)
813 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit clifford.Cl(5)
872 ms ± 47.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Without this patch:

In [1]: import clifford

In [2]: %timeit clifford.Cl(5)
596 ms ± 9.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit clifford.Cl(5)
658 ms ± 40.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@stuartarchibald, can you think of why this would be the case?

@eric-wieser perhaps this: https://github.com/numba/numba/issues/4927 ?

Jan 23 '20 17:01 stuartarchibald

The before and after measurements are done with the same dependency versions, and are after numba and sparse have already been imported - it seems that the compilation time is where the cost comes from.

Jan 23 '20 18:01 eric-wieser

The before and after measurements are done with the same dependency versions, and are after numba and sparse have already been imported - it seems that the compilation time is where the cost comes from.

Ah, could well be, there's some degree of caching available cache=True in the jit decorators, there's also compilation pass timing metadata available if needs be.

Jan 23 '20 18:01 stuartarchibald

Before this patch:

>>> next(iter(Cl(5)[0].gmt_func.overloads.values())).metadata
{'parfor_diagnostics': ParforDiagnostics,
 'pipeline_times': {'nopython': OrderedDict([('0_translate_bytecode',
                pass_timings(init=5.699999746866524e-06, run=0.009091500003705733, finalize=6.000002031214535e-06)),
               ('1_fixup_args',
                pass_timings(init=3.7999998312443495e-06, run=7.299997378140688e-06, finalize=3.000008291564882e-06)),
               ('2_ir_processing',
                pass_timings(init=2.4000037228688598e-06, run=0.003318200004287064, finalize=4.799992893822491e-06)),
               ('3_with_lifting',
                pass_timings(init=4.1000021155923605e-06, run=0.004175599999143742, finalize=5.699999746866524e-06)),
               ('4_rewrite_semantic_constants',
                pass_timings(init=4.699992132373154e-06, run=0.0002558999985922128, finalize=3.1999952625483274e-06)),
               ('5_dead_branch_prune',
                pass_timings(init=2.4999899324029684e-06, run=0.0006873999955132604, finalize=4.8000074457377195e-06)),
               ('6_generic_rewrites',
                pass_timings(init=2.4000037228688598e-06, run=0.01509299999452196, finalize=3.000008291564882e-06)),
               ('7_inline_closure_likes',
                pass_timings(init=1.7999991541728377e-06, run=0.0041402999922866, finalize=5.299996701069176e-06)),
               ('8_make_function_op_code_to_jit_function',
                pass_timings(init=5.1000097300857306e-06, run=5.8599995099939406e-05, finalize=2.1000014385208488e-06)),
               ('9_inline_inlinables',
                pass_timings(init=2.0000006770715117e-06, run=9.730001329444349e-05, finalize=1.5999976312741637e-06)),
               ('10_dead_branch_prune',
                pass_timings(init=1.3999961083754897e-06, run=0.0006452999950852245, finalize=2.5000044843181968e-06)),
               ('11_find_literally',
                pass_timings(init=2.2000021999701858e-06, run=9.169999975711107e-05, finalize=1.600012183189392e-06)),
               ('12_literal_unroll',
                pass_timings(init=1.5999976312741637e-06, run=5.629999213851988e-05, finalize=2.4000037228688598e-06)),
               ('13_nopython_type_inference',
                pass_timings(init=2.2000021999701858e-06, run=0.027430600006482564, finalize=4.799992893822491e-06)),
               ('14_annotate_types',
                pass_timings(init=2.4999899324029684e-06, run=2.750000567175448e-05, finalize=2.5000044843181968e-06)),
               ('15_inline_overloads',
                pass_timings(init=2.0000006770715117e-06, run=0.00013780000153928995, finalize=1.6999983927235007e-06)),
               ('16_nopython_rewrites',
                pass_timings(init=2.6000052457675338e-06, run=0.003319099996588193, finalize=4.2000028770416975e-06)),
               ('17_ir_legalization',
                pass_timings(init=2.3000029614195228e-06, run=0.002335299999685958, finalize=3.1999952625483274e-06)),
               ('18_nopython_backend',
                pass_timings(init=2.0000006770715117e-06, run=0.22158709999348503, finalize=4.1000021155923605e-06)),
               ('19_dump_parfor_diagnostics',
                pass_timings(init=2.2000021999701858e-06, run=1.68999977177009e-05, finalize=6.100002792663872e-06))])}}

After this patch:

{'parfor_diagnostics': ParforDiagnostics,
 'pipeline_times': {'nopython': OrderedDict([('0_translate_bytecode',
                pass_timings(init=3.599999985226532e-06, run=0.005174699999997756, finalize=3.7000000077114237e-06)),
               ('1_fixup_args',
                pass_timings(init=1.6999999843392288e-06, run=3.599999985226532e-06, finalize=9.999999974752427e-07)),
               ('2_ir_processing',
                pass_timings(init=1.1999999856016075e-06, run=0.0019801000000256863, finalize=2.2000000114985596e-06)),
               ('3_with_lifting',
                pass_timings(init=1.8000000068241206e-06, run=0.0027746000000092863, finalize=3.2000000089738023e-06)),
               ('4_rewrite_semantic_constants',
                pass_timings(init=1.499999996212864e-06, run=0.0001085999999759224, finalize=1.499999996212864e-06)),
               ('5_dead_branch_prune',
                pass_timings(init=1.4000000021496817e-06, run=0.0003125000000068212, finalize=2.099999989013668e-06)),
               ('6_generic_rewrites',
                pass_timings(init=1.7000000127609383e-06, run=0.020567699999986644, finalize=2.3999999996249244e-06)),
               ('7_inline_closure_likes',
                pass_timings(init=1.7000000127609383e-06, run=0.003134299999999257, finalize=3.8999999958377884e-06)),
               ('8_make_function_op_code_to_jit_function',
                pass_timings(init=1.499999996212864e-06, run=4.7899999998435305e-05, finalize=1.099999991538425e-06)),
               ('9_inline_inlinables',
                pass_timings(init=1.4000000021496817e-06, run=8.489999999028441e-05, finalize=1.499999996212864e-06)),
               ('10_dead_branch_prune',
                pass_timings(init=1.1999999856016075e-06, run=0.0003067000000100961, finalize=1.5999999902760464e-06)),
               ('11_find_literally',
                pass_timings(init=1.3000000080864993e-06, run=6.790000000478358e-05, finalize=1.099999991538425e-06)),
               ('12_literal_unroll',
                pass_timings(init=1.099999991538425e-06, run=4.330000001573353e-05, finalize=1.3000000080864993e-06)),
               ('13_nopython_type_inference',
                pass_timings(init=1.499999996212864e-06, run=0.026779699999991635, finalize=3.399999997100167e-06)),
               ('14_annotate_types',
                pass_timings(init=1.9999999949504854e-06, run=2.7200000005223046e-05, finalize=1.499999996212864e-06)),
               ('15_inline_overloads',
                pass_timings(init=1.4000000021496817e-06, run=0.0002345999999988635, finalize=1.5999999902760464e-06)),
               ('16_nopython_rewrites',
                pass_timings(init=1.3000000080864993e-06, run=0.0052894000000094366, finalize=4.5999999827017746e-06)),
               ('17_ir_legalization',
                pass_timings(init=3.10000001491062e-06, run=0.004680199999995693, finalize=4.500000017060302e-06)),
               ('18_nopython_backend',
                pass_timings(init=2.8000000042993634e-06, run=0.3401664000000153, finalize=3.399999997100167e-06)),
               ('19_dump_parfor_diagnostics',
                pass_timings(init=1.5999999902760464e-06, run=8.699999995087637e-06, finalize=1.1999999856016075e-06))])}}

Feb 11 '20 13:02 eric-wieser

@eric-wieser So it actually comes out faster?

Feb 11 '20 13:02 hugohadfield

No - note the 18_nopython_backend section, which is 0.1s longer. This is just for one of our compiled operator functions, and we have maybe around 5 of them? That was something @stuartarchibald asked for, so i figured I'd post it here.

Feb 11 '20 13:02 eric-wieser

clifford clifford copied to clipboard

Remove workarounds now that sparse-0.9.1 has numba support

clifford
clifford copied to clipboard