Speed ups from compiling with specific arch
We should discuss how best to deal with the fact that compilers are getting smarter but you need to tell them what arch you are working with. For example https://godbolt.org/g/8EyZEJ counts the number of set bits which on a haswell (any not very old intel CPU) or newer results in a single instruction specifically made for this. Remove the -march=haswell to see the long form.
On my desktop compiling khmer with -march=skylake brings a few percent of speed up.
Not sure what the recommended arch is for binaries distributed via PyPI but I'd bet it isn't -march=haswell. So we can't just put it into setup.py.
Credit for making me think about this: https://www.youtube.com/watch?v=bSkpMdDe4g4 also mentions various other tricks.
This is surprisingly difficult to do.... import platform; platform.platform() can tell you if your system feels like it; parsing
/proc/cpuinfo can as well, once again, if your system feels like it. On
my lab desktop, /proc/cpuinfo will tell me that I have an i7-3820, but
not that it's a Sandy Bridge chip. On my Macbook, /usr/sbin/sysctl -e machdep.cpu will give me a bunch of numerical codes for model, family,
etc, which can probably be translated, but aren't informative on their own.
Best option is probably to let users pass in their own -march, but I
don't know if that can be done with pip either...
On Thu, Oct 12, 2017 at 1:07 AM, Tim Head [email protected] wrote:
We should discuss how best to deal with the fact that compilers are getting smarter but you need to tell them what arch you are working with. For example https://godbolt.org/g/8EyZEJ counts the number of set bits which on a haswell (any not very old intel CPU) or newer results in a single instruction specifically made for this. Remove the -march=haswell to see the long form.
On my desktop compiling khmer with -march=skylake brings a few percent of speed up.
Not sure what the recommended arch is for binaries distributed via PyPI but I'd bet it isn't -march=haswell. So we can't just put it into setup.py .
Credit for making me think about this: https://www.youtube.com/watch? v=bSkpMdDe4g4 also mentions various other tricks.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dib-lab/khmer/issues/1799, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwxrZ7M92SzHp_aawvjumVMkvM5J8zYks5srcirgaJpZM4P2npF .
-- Camille Scott
Graduate Group for Computer Science Lab for Data Intensive Biology University of California, Davis
-march=native seems to do the right thing when testing on my laptop (super old no haswell) and my linux desktop.
Doesn't solve the question of what arch we should use when building binaries for others to use.
You can try to follow what is being done in this lib: https://github.com/kimwalisch/libpopcnt (they detect at runtime what is available). Not sure how scalable the solution is for more instructions, and not even sure if it is a good idea (since we want to let the compiler take care of it), but I thought it was worth throwing this here.
On Fri, Oct 20, 2017 at 12:08 PM, Tim Head [email protected] wrote:
-march=native seems to do the right thing when testing on my laptop (super old no haswell) and my linux desktop.
Doesn't solve the question of what arch we should use when building binaries for others to use.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dib-lab/khmer/issues/1799#issuecomment-338296792, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAZ8p-LemeD9nxLZNO2R36mSDc4liQZks5suO-igaJpZM4P2npF .
Can setup.py execute some minimal code at compile time to detect the architecture and adjust options accordingly.
This is getting into the realm of hairy limited-shelf-life-solutions, admittedly.
Right now I think -march=native would be good enough for most people (and the speedups seem to be small anyway so not worth adding too much magic?). With maybe some if statements in setup.py to detect when we are building wheels/binaries for distribution in which case you turn it off/set it to what the recommended arch is.