parlaylib
parlaylib copied to clipboard
`parlay::sequence` constructor much slower than reserve + fill, and both slower than `std::vector`
The following code measures
- Time for creating
parlay::sequencevia reserve + fill and the number of distinct buffers allocated in 100000 iterations of a parallel for loop. - Time for creating
parlay::sequencevia its constructor and the number of distinct buffers allocated in 100000 iterations of a parallel for loop. - Time for creating
std::vectorvia its constructor and the number of distinct buffers allocated in 100000 iterations of a parallel for loop.
#include <iostream>
#include <vector>
#include <chrono>
// #include "infra/infra.hpp"
#include <parlay>
int main() {
int ntab = 100000;
auto now = std::chrono::steady_clock::now();
auto ptrs1 = parlay::tabulate(ntab, [&](int i)->auto {
parlay::sequence<double> a; a.reserve(20000); std::fill(a.begin(), a.end(), 0);
parlay::sequence<double> b; b.reserve(20000); std::fill(b.begin(), b.end(), 0);
return std::pair(a.data(), b.data());
});
std::cout << "Reserve and fill, time cost = " << std::chrono::duration_cast<
std::chrono::microseconds> (std::chrono::steady_clock::now() - now).count();
std::cout << "\nNumber of distinct vectors allocated = " <<
parlay::remove_duplicates(parlay::slice(
&ptrs1.front().first, &ptrs1.back().second + 1)).size() << "\n\n";
decltype(ptrs1)().swap(ptrs1);
now = std::chrono::steady_clock::now();
auto ptrs2 = parlay::tabulate(ntab, [&](int i)->auto {
parlay::sequence<double> a(20000, 0);
parlay::sequence<double> b(20000, 0);
return std::pair(a.data(), b.data());
});
std::cout << "Plain initialization, time cost = " << std::chrono::duration_cast<
std::chrono::microseconds> (std::chrono::steady_clock::now() - now).count();
std::cout << "\nNumber of distinct vectors allocated = " <<
parlay::remove_duplicates(parlay::slice(
&ptrs2.front().first, &ptrs2.back().second + 1)).size() << "\n\n";
decltype(ptrs2)().swap(ptrs2);
now = std::chrono::steady_clock::now();
auto ptrs3 = parlay::tabulate(ntab, [&](int i)->auto {
std::vector<double> a(20000, 0);
std::vector<double> b(20000, 0);
return std::pair(a.data(), b.data());
});
std::cout << "Use std::vector, time cost = " << std::chrono::duration_cast<
std::chrono::microseconds> (std::chrono::steady_clock::now() - now).count();
std::cout << "\nNumber of distinct vectors allocated = " <<
parlay::remove_duplicates(parlay::slice(
&ptrs3.front().first, &ptrs3.back().second + 1)).size() << "\n\n";
decltype(ptrs3)().swap(ptrs3);
}
Compiling with g++ -std=c++20 -O2 -lpthread -march=native code/cpp/parlayTest.cpp -o test, a sample printout can be:
Reserve and fill, time cost = 59335
Number of distinct vectors allocated = 28
Plain initialization, time cost = 154726
Number of distinct vectors allocated = 49
Use std::vector, time cost = 50010
Number of distinct vectors allocated = 34
The numbers 28 and 49 will change in different runs.
What surprised me a little bit is that (i) parlay::sequence() is substantially slower than reserve + fill and triggers more new allocations, (ii) using parlay::sequence is slower than using std::vector.
I have long believed in the superiority of parlay's memory allocation, but this example says otherwise. Is there a simple explanation?
What's the rule of thumb for allocating sequences within a parallel for loop using parlay?
Thanks a lot!