anonlink Popcount function is under performing

In branch hlaw-fix-issue56 the raw popcount function performance is reported to be about 12 GiB/s (on my machine), however (essentially) the same code performs at 18 GiB/s when called from a basic wrapper function thus:

// -*- compile-command: "g++ -ggdb -Wall -Wextra -std=c++11 -O3 -mssse3 -mpopcnt -fvisibility=hidden -o bench bench.cc dice_one_against_many.cpp" -*-

#include <iostream>
#include <stdint.h>
using namespace std;

double popcount_1024_array(const char *many, int n, uint32_t *counts_many);

int main(int argc, char *argv[]) {
    static constexpr int KEYWORDS = 16;
    long n = 1 << 20;
    if (argc > 1)
        n = atol(argv[1]);
    uint64_t *buf = new uint64_t[KEYWORDS * n];
    uint32_t *counts = new uint32_t[n];
    long nbytes = n*KEYWORDS*sizeof(uint64_t);
    double t = popcount_1024_array(reinterpret_cast<const char *>(buf), n, counts);
    cout << "Time: " << t << "ms => " << (n/1e6) * (1e3/t) << " 1e6 popc/s => "
         <<  (nbytes / (double)(1 << 20)) * (1e3/t) << " MiB/s" << endl;
    delete[] counts;
    delete[] buf;
    return 0;
}

Aha! Link: https://csiro.aha.io/features/ANONLINK-71

Feb 08 '18 23:02 unzvfu

Some resources that have come up while investigating this:

https://stackoverflow.com/questions/48981887/how-can-i-obtain-consistently-high-throughput-in-this-loop
https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single-threaded-memory-throug
https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/480004
http://0x80.pl/articles/sse-popcount.html
https://stackoverflow.com/questions/45376006/large-0-1-matrix-multiplication-using-bitwise-and-and-popcount-instead-of-actu/45392171

Mar 02 '18 01:03 unzvfu

So, it turns out that at least part of the disparity between the observed throughput of (about 12-14 GiB/s) and the actual (expected) bandwidth (about 20 GiB/s on a 2.4 GHz CPU) of popcounting is explained by the difference between doing a "lots of popcounts on small arrays" (the first case) and doing "fewer popcounts on larger arrays" (the second case). This is presumably because the "output pipeline" is stalling/stumbling at the point where the popcount is calculated and written back to memory, which happens much more often in the first case compared to the second.

There is no obvious fix to this. Further investigation would be necessary to find a solution to this (which would almost certainly be a fiddly and low-level one).

Mar 09 '18 03:03 unzvfu

I'm having a look at switching to using https://github.com/kimwalisch/libpopcnt instead of hand rolling our own - this should also help us support other platforms and take advantage of newer/different instruction sets too.

https://github.com/data61/anonlink/tree/windows-support

Oct 24 '19 01:10 hardbyte