numexpr
numexpr copied to clipboard
NE3: Try BLOCKSIZE in bytes rather than elements
In NE2 loops were unrolled, which resulted in a strong preference for a BLOCK_SIZE in terms of elements rather than bytes. This meant that the BLOCK_SIZE was generally optimized relative to the L1 cache for float64
. As NE3 now uses vectorization, we don’t see a performance difference between fixed-length loops versus not, so this functionality has already been commented out (which reduces compilation time significantly). Therefore it may further make sense to refactor BLOCK_SIZE in terms of bytes.
Steps to complete:
1.) ideally the item size can be embedded in the function itself, by the code_generator? Or,
2.) insert a section above the #include interp_body.cpp
macros that calculates the appropriate block size in elements from the CACHE_SIZE.
3.) The ability to change the CACHE_SIZE should ideally be included as an argument to setup.py
.