`parlay::sequence` constructor much slower than reserve + fill, and both slower than `std::vector`

Open WhateverLiu opened this issue 7 months ago • 4 comments

The following code measures

Time for creating parlay::sequence via reserve + fill and the number of distinct buffers allocated in 100000 iterations of a parallel for loop.
Time for creating parlay::sequence via its constructor and the number of distinct buffers allocated in 100000 iterations of a parallel for loop.
Time for creating std::vector via its constructor and the number of distinct buffers allocated in 100000 iterations of a parallel for loop.

#include <iostream>
#include <vector>
#include <chrono>
// #include "infra/infra.hpp"
#include <parlay>

int main() {
  int ntab = 100000;
  auto now = std::chrono::steady_clock::now();
  auto ptrs1 = parlay::tabulate(ntab, [&](int i)->auto {
    parlay::sequence<double> a; a.reserve(20000); std::fill(a.begin(), a.end(), 0);
    parlay::sequence<double> b; b.reserve(20000); std::fill(b.begin(), b.end(), 0);
    return std::pair(a.data(), b.data());
  });
  std::cout << "Reserve and fill, time cost = " << std::chrono::duration_cast<
    std::chrono::microseconds> (std::chrono::steady_clock::now() - now).count();
  std::cout << "\nNumber of distinct vectors allocated = " << 
  parlay::remove_duplicates(parlay::slice(
      &ptrs1.front().first, &ptrs1.back().second + 1)).size() << "\n\n";
  decltype(ptrs1)().swap(ptrs1);
  
  
  now = std::chrono::steady_clock::now();
  auto ptrs2 = parlay::tabulate(ntab, [&](int i)->auto {
    parlay::sequence<double> a(20000, 0);
    parlay::sequence<double> b(20000, 0);
    return std::pair(a.data(), b.data());
  });
  std::cout << "Plain initialization, time cost = " << std::chrono::duration_cast<
    std::chrono::microseconds> (std::chrono::steady_clock::now() - now).count();
  std::cout << "\nNumber of distinct vectors allocated = " << 
    parlay::remove_duplicates(parlay::slice(
        &ptrs2.front().first, &ptrs2.back().second + 1)).size() << "\n\n";
  decltype(ptrs2)().swap(ptrs2);
  
  
  now = std::chrono::steady_clock::now();
  auto ptrs3 = parlay::tabulate(ntab, [&](int i)->auto {
    std::vector<double> a(20000, 0);
    std::vector<double> b(20000, 0);
    return std::pair(a.data(), b.data());
  });
  std::cout << "Use std::vector, time cost = " << std::chrono::duration_cast<
    std::chrono::microseconds> (std::chrono::steady_clock::now() - now).count();
  std::cout << "\nNumber of distinct vectors allocated = " << 
    parlay::remove_duplicates(parlay::slice(
        &ptrs3.front().first, &ptrs3.back().second + 1)).size() << "\n\n";
  decltype(ptrs3)().swap(ptrs3);
}

Compiling with g++ -std=c++20 -O2 -lpthread -march=native code/cpp/parlayTest.cpp -o test, a sample printout can be:

Reserve and fill, time cost = 59335
Number of distinct vectors allocated = 28

Plain initialization, time cost = 154726
Number of distinct vectors allocated = 49

Use std::vector, time cost = 50010
Number of distinct vectors allocated = 34

The numbers 28 and 49 will change in different runs.

What surprised me a little bit is that (i) parlay::sequence() is substantially slower than reserve + fill and triggers more new allocations, (ii) using parlay::sequence is slower than using std::vector.

I have long believed in the superiority of parlay's memory allocation, but this example says otherwise. Is there a simple explanation?

What's the rule of thumb for allocating sequences within a parallel for loop using parlay?

Thanks a lot!

May 17 '25 07:05 WhateverLiu