pisa icon indicating copy to clipboard operation
pisa copied to clipboard

Improve block codecs

Open gustingonzalez opened this issue 4 years ago • 3 comments

Hi everyone! I bring up for discussion some possible improvements that could be introduced in the block codecs.

Interpolative

Starting here: https://github.com/pisa-engine/pisa/blob/35dd03b73506d2089686a4a211e38898671b425c/include/pisa/codec/block_codecs.hpp#L151 Why not just use in instead of inbuf?

On the other hand, in the next line: https://github.com/pisa-engine/pisa/blob/35dd03b73506d2089686a4a211e38898671b425c/include/pisa/codec/block_codecs.hpp#L152 Could there be an improvement using numeric_limits<uint32_t>::max() instead of uint32_t(-1)?

Simple16

Given that 28 is the maximum batch size, I think that the buf size could be reduced to 4 * block_size + block_size % 28 instead of 2 * 8 * block_size: https://github.com/pisa-engine/pisa/blob/35dd03b73506d2089686a4a211e38898671b425c/include/pisa/codec/simple16.hpp#L14 Although the buffer size is not a problem for the current memory sizes, if you want to use a block_size greater than 128, the memory saving is more substantial. On the other hand, the Simple8b codec could be improved in a similar way.

Again on Simple16, on the decoding could be used a memcpy instruction instead a raw loop: https://github.com/pisa-engine/pisa/blob/35dd03b73506d2089686a4a211e38898671b425c/include/pisa/codec/simple16.hpp#L30 Alternatively, it could be used the std::copy function, but it should be slower than memcpy as with https://lemire.me/blog/2020/01/20/filling-large-arrays-with-zeroes-quickly-in-c/.

MaskedVByte

Although 4 bytes is the maximum size of a raw integer, in the MaskedVByte codec is required a continuation bit for each byte. Then, the maximum size of an integer encoded with this one will be 5 bytes. Then, buf could be redefined as 5 * block_size: https://github.com/pisa-engine/pisa/blob/35dd03b73506d2089686a4a211e38898671b425c/include/pisa/codec/maskedvbyte.hpp#L21

In same way, the buffers of VarintGB and VarintG8IU codecs could be resized as 4 + 1 bytes, assuming (for simplicity) 1 byte descriptor for each integer. Currently the buffer size for these codecs is 2 * 4 * block_size, too.

gustingonzalez avatar Jun 04 '20 19:06 gustingonzalez

On another subtopic, after some quick tests I've also found that using thread_local adds a slight speed penalty, can it be? If so, is its use essential when decoding?

gustingonzalez avatar Jun 04 '20 20:06 gustingonzalez

I haven't looked into all of them yet, but a few points:

  1. Since the block size is always 128, I'd replace thread_local std::vector<uint8_t> with constexpr std::array<uint8_t, size>, which will also require to do static constexpr uint64_t block_size = 128;.
  2. I don't see the reason for inbuf either.
  3. I totally agree about numeric_limits
  4. I agree with removing the raw loop but we should just use std::copy. We should measure just in case, but I'm sure it will be as fast as can be. There are a few problems with that post, which some people also point out in the comment. In short, the compilation was done on -O2 and also there was an implicit type casting that at the moment prevented gcc from optimizing it away (you can see how it can happen, if you try to copy, say, 16-bit integers to an array of 32-bit integers, then you can't just memset as is). From what I see in the comments, even that was already resolved. Anyway, we should have absolutely no issues copying between two arrays of the same type with std::copy. That said, we should test if there's any difference.

elshize avatar Jun 04 '20 20:06 elshize

@amallia can you weigh in on these?

elshize avatar Jun 08 '20 12:06 elshize

On another subtopic, after some quick tests I've also found that using thread_local adds a slight speed penalty, can it be? If so, is its use essential when decoding?

@gustingonzalez I can think of two ways it could affect performance. (1) because we are using dynamic memory, it is possible that it is implemented by the compiler as a lazy initialization, which would add a check each time. (2) even if not dynamic, thread local array would be still somewhere farther from the top of the execution stack, which may affect caching. On both accounts, it's hard to say without running some experiments.

I just submitted a PR: https://github.com/pisa-engine/pisa/pull/492 As far as I can tell, there's nothing there that could slow anything down; however, I did change vector to array in the Simple codecs. This could potentially make it faster due to (1) from above.

I wasn't able to run any benchmarks at the moment. If you could check performance, that would help a lot.

There is nothing about the algorithm that makes thread local necessary. The idea is, as far as I can tell (I didn't write this code myself), is that we don't want to repeatedly allocate memory. If we change it to array, we won't have to allocate, so we might be able to remove it. But idk how that would affect caching and inlining. We would have to measure.

To be honest, I don't see how just removing thread_local in the current implementation (with vector) can improve performance, but maybe I'm missing something...

FYI @JMMackenzie @amallia

elshize avatar Nov 20 '22 02:11 elshize

I performed some experiments on CW12B. I used CIFF query-only index, exported to PISA, then ran BMW with and without the changes in the related PR. I used Web Track queries from 12-14. For both baseline (master) and the changed version (improve-block-codecs) I ran all queries 20 times, and recorded the results. I only ran Simple16 at this time, but the changes I made for Simple8b are almost identical.

It seems that in fact the changes I made improve performance. I'm guessing it's because thread local on heap-allocated variable was doing some sort of lazy initialization which could have slowed it down + it was on the heap as opposed to on the stack as it is now.

Here is a box-and-whiskers plot for different quantiles (and average):

image

And zoomed in on the average:

image

I did this on my desktop, but I have to say that given my reasoning and those results, I am fairly convinced that this change at least won't slow us down.

I would suggest implementing static-length std::array for buffers for the rest of the codecs mentioned in this thread. I will want to run the same test for those as well eventually.

@amallia @JMMackenzie let me know what you think about all this.

elshize avatar Nov 20 '22 20:11 elshize