seqan3 icon indicating copy to clipboard operation
seqan3 copied to clipboard

[Alphabet] seqan3::concatenated_sequences benchmark

Open smehringer opened this issue 3 years ago • 7 comments

Description

seqan3::concatenated_sequences is a container that is supposed to be more efficient when using "static" input data (many read operations, few to no write operations) In contrast to a std::vector<std::vector>, where each inner vector does its own allocation, seqan3::concatenated_sequences stores the data in contiguous memory. In theory, this leads to fewer cache misses when accessing the data structure, and access should be hence faster.

But we never have tested this advantage.

https://github.com/seqan/seqan3/pull/2947 introduced member functions for efficient push_back. The efficiency of push_back should also be tested.

Tasks

General Considerations

Choose some appropriate size(s) (inspiration). If the whole data structure fits into the CPU cache, we cannot see how fast "actual memory" access is. Since the "big" benchmarks might take quite a while in Debug (Nightlies), you can set different sizes for Release (used for actual benchmarking) and Debug (Nightlies). For an example, see how we did it in the format_sam_benchmark.

Benchmarking Access

For both std::vector<seqan3::dna4> and std::string (=underlying_container_t), write a benchmark that measures many (random access) read operations

  • [ ] on a seqan3::concatenated_sequences<underlying_container_t>
  • [ ] on a std::vector<underlying_container_t>
  • [ ] on a SeqAn2 pendent¹

Benchmarking Appending

For both std::vector<seqan3::dna4> and std::string (=underlying_container_t), write a benchmark that measures many push_back operations

  • [ ] on a seqan3::concatenated_sequences<underlying_container_t>² - the old way before #2947
  • [ ] on a seqan3::concatenated_sequences<underlying_container_t>³ - the new way after #2947
  • [ ] on a std::vector<underlying_container_t> - should accomplish same task as shown in ² and ³
  • [ ] #2977

Footnotes

¹ Notes regarding SeqAn2
  • Known equivalent in SeqAn2: seqan::StringSet<TString, seqan::Owner<seqan::ConcatDirect>>
  • Check whether there are other SeqAn2 specialisations that are equivalent
  • std::string equivalent in SeqAn2 is seqan::CharString
  • std::vector<seqan3::dna4>equivalent in SeqAn2 is seqan::DnaString
² Example code for old way
std::string const input{"This is some (possible large) input."};
seqan3::concatenated_sequences<std::string> output{};

for (/*benchmark loop*/)
{
    std::string buffer{};

    std::ranges::copy(input, buffer);
    buffer += '|';
    std::ranges::copy(input, buffer);
    buffer += '|';
    std::ranges::copy(input, buffer);

    output.push_back(string_buffer);
}
³ Example code for new way
std::string const input{"This is some (possible large) input."};
seqan3::concatenated_sequences<std::string> output{};

for (/*benchmark loop*/)
{
    output.push_back();
    output.last_append(input);
    output.last_push_back('|');
    output.last_append(input);
    output.last_push_back('|');
    output.last_append(input);
}

smehringer avatar Mar 21 '22 10:03 smehringer

Note that you need to get out of the realm of CPU caches for the benchmark to be meaningful!

h-2 avatar Mar 21 '22 13:03 h-2

@h-2 Is that something we have to do in every single benchmark (like this one) or some over all adaptation?

MitraDarja avatar Mar 21 '22 13:03 MitraDarja

While you were commenting, I was editing the issue – I thought the same and already added a few sentences.

However, I'm not so knowledgeable as to what cache level I need to escape — surely L1 and L2; what about L3?

And what would be good sizes to securely achieve this? We should be good with 8 MiB for L2, right?

@h-2 do you know from the top of your head?

Otherwise, I would just consider it a research task for this issue :grin:

eseiler avatar Mar 21 '22 13:03 eseiler

Is that something we have to do in every single benchmark (like this one) or some over all adaptation?

I am not sure. Maybe try with sizes similar to the other benchmarks first and see if that produces interesting results or not. I think there were plans for "macro-benchmarks" at some point and this could be part of one, but that maybe does not have to happen now.

h-2 avatar Mar 21 '22 13:03 h-2

Otherwise, I would just consider it a research task for this issue

The cache sizes vary a lot. I just wanted to point out that I have seen very significant speed improvements with the change. In the range of 50% speed-up for overall runtime of my program. And the concat_sequences can be 100s of MBs in my program, so I don't know if similar speed-ups are visible when everything fits in the cache.

edit: the reported 50% speed-gain is for switching from vector-of-vector to concat_seqs, so definitely not just because of the change.

h-2 avatar Mar 21 '22 13:03 h-2

Otherwise, I would just consider it a research task for this issue

The cache sizes vary a lot. I just wanted to point out that I have seen very significant speed improvements with the change. In the range of 50% speed-up for overall runtime of my program. And the concat_sequences can be 100s of MBs in my program, so I don't know if similar speed-ups are visible when everything fits in the cache.

I'm sure it's faster, but we need a benchmark at the very least for regression testing :D

I would just test a tiny size (let's say 1000) and a big size (8*1024*1024) and see if there are significant differences in the timing for a fixed number of random accesses. If yes, we probably avoided staying in cache constantly. After having verified this, we can use 1000 or so for Debug mode, and 8*1024*1024 or so for Release mode.

eseiler avatar Mar 21 '22 13:03 eseiler

I would just test a tiny size (let's say 1000) and a big size (810241024) and see if there are significant differences in the timing for a fixed number of random accesses. If yes, we probably avoided staying in cache constantly.

If you are benchmarking random-access-reads (unrelated to the changes from my PR), I would actually recommend creating a large sequence (say 100MB total off 1000 subseqs of 100KB each) and reading from that. Creating the sequence can be performed before the benchmark.

h-2 avatar Mar 21 '22 17:03 h-2