h3 icon indicating copy to clipboard operation
h3 copied to clipboard

[WIP] Fast isValidCell

Open ajfriend opened this issue 4 years ago • 7 comments

In working on a new compactCells implementation, isValidCell was showing up as a bottleneck in some of the benchmarks, so I started playing around with optimizing it.

This is still experimental (e.g., I'm still using 7 instead of the INVALID_DIGIT macro just because I find that easier for prototyping).

Putting this up now as a draft PR to get some early feedback, to take notes on what I've tried so far, and to discuss different implementation options.

Current benchmarks have this new implementation taking about 60% of the time of the current implementation. Note that the benchmarks can depend on the resolution of the cells; low resolution cells tend to go faster because the last INVALID_DIGIT check can avoid more looping (I think...).

ajfriend avatar Jul 05 '21 22:07 ajfriend

Coverage Status

Coverage decreased (-2.5%) to 96.201% when pulling 5a707476e77f2378b73088d5ff7de500bf19f2f0 on ajfriend:fast_isValidCell into 42f56e35a9f568d82fce294dd3f57ea2dc6b2e18 on uber:master.

coveralls avatar Jul 05 '21 22:07 coveralls

Some benchmark runs. New algo seems to take roughly 65% or 53% of the time of the current, depending on the input cells.

new algo

build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1141.561000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 785.043000 microseconds per iteration (1000 iterations)
build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1138.550000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 783.216000 microseconds per iteration (1000 iterations)
build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1141.815000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 782.925000 microseconds per iteration (1000 iterations)

current algo

build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1788.558000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 1462.292000 microseconds per iteration (1000 iterations)
build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1750.815000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 1502.663000 microseconds per iteration (1000 iterations)
build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1773.345000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 1467.033000 microseconds per iteration (1000 iterations)

Here are the ratios:

new = [
    [1141.561000, 785.043000],
    [1138.550000, 783.216000],
    [1141.815000, 782.925000],
]
old = [
    [1788.558000, 1462.292000],
    [1750.815000, 1502.663000],
    [1773.345000, 1467.033000],
]

np.array(new)/np.array(old)

gives

array([[0.63825775, 0.53685789],
       [0.65029715, 0.52121866],
       [0.6438764 , 0.5336792 ]])

ajfriend avatar Jul 05 '21 23:07 ajfriend

todo: maybe add a benchmark that runs through all the cells in a resolution.

ajfriend avatar Jul 15 '21 23:07 ajfriend

@slaperche-zenly It looks like your fast version of the pentagon branch would depend on a C equivalent of leading_zeros(): https://gist.github.com/slaperche-zenly/204e9b8e305cfb8ce2eec49fbe7b9396#file-lib-rs-L62

Anyone have an idea if there is such a C equivalent?

ajfriend avatar Feb 13 '22 21:02 ajfriend

@slaperche-zenly It looks like your fast version of the pentagon branch would depend on a C equivalent of leading_zeros(): https://gist.github.com/slaperche-zenly/204e9b8e305cfb8ce2eec49fbe7b9396#file-lib-rs-L62

Anyone have an idea if there is such a C equivalent?

C compilers usually expose this as a builtin or intrinsic (IIRC it's __builtin_clz for GCC), but naming may differ across compilers and would require some conditional compilation to select the right one, plus a fallback on a handwritten version for unsupported compiler/architecture (this page lists some way to implement it, they call it "Find the log base 2 of an integer")

slaperche-zenly avatar Feb 13 '22 21:02 slaperche-zenly

C compilers usually expose this as a builtin or intrinsic (IIRC it's __builtin_clz for GCC), but naming may differ across compilers and would require some conditional compilation to select the right one, plus a fallback on a handwritten version for unsupported compiler/architecture (this page lists some way to implement it, they call it "Find the log base 2 of an integer")

That page is cool! I was thinking of the trick converting the int to a float and extracting the exponent, but not sure how portable that code would be. @isaacbrodsky @nrabinowitz @dfellis, thoughts?

ajfriend avatar Feb 13 '22 21:02 ajfriend

I was thinking of the trick converting the int to a float and extracting the exponent, but not sure how portable that code would be

As far as portability goes, you should be fine as long as you have IEEE-754 compliant floating point numbers, which should be everywhere (bar a few embedded or old/exotic systems).

I would be more worried about the type punning that may or may not run afoul of the strict aliasing rules (IIRC, the standard is not super clear on what’s allowed and what summon nasal daemons...) Cf. here for a quick overview of the mess 😅

The approach based on DeBruijn sequence, adapted to 64-bit, sounds safer to me.

slaperche-zenly avatar Feb 13 '22 23:02 slaperche-zenly