pisa
pisa copied to clipboard
Improve block codecs
Hi everyone! I bring up for discussion some possible improvements that could be introduced in the block codecs
.
Interpolative
Starting here:
https://github.com/pisa-engine/pisa/blob/35dd03b73506d2089686a4a211e38898671b425c/include/pisa/codec/block_codecs.hpp#L151
Why not just use in
instead of inbuf
?
On the other hand, in the next line:
https://github.com/pisa-engine/pisa/blob/35dd03b73506d2089686a4a211e38898671b425c/include/pisa/codec/block_codecs.hpp#L152
Could there be an improvement using numeric_limits<uint32_t>::max()
instead of uint32_t(-1)
?
Simple16
Given that 28 is the maximum batch size
, I think that the buf
size could be reduced to 4 * block_size + block_size % 28
instead of 2 * 8 * block_size
:
https://github.com/pisa-engine/pisa/blob/35dd03b73506d2089686a4a211e38898671b425c/include/pisa/codec/simple16.hpp#L14
Although the buffer size is not a problem for the current memory sizes, if you want to use a block_size
greater than 128, the memory saving is more substantial. On the other hand, the Simple8b
codec could be improved in a similar way.
Again on Simple16
, on the decoding could be used a memcpy
instruction instead a raw loop
:
https://github.com/pisa-engine/pisa/blob/35dd03b73506d2089686a4a211e38898671b425c/include/pisa/codec/simple16.hpp#L30
Alternatively, it could be used the std::copy
function, but it should be slower than memcpy
as with https://lemire.me/blog/2020/01/20/filling-large-arrays-with-zeroes-quickly-in-c/.
MaskedVByte
Although 4 bytes
is the maximum size of a raw integer, in the MaskedVByte
codec is required a continuation bit for each byte. Then, the maximum size of an integer encoded with this one will be 5 bytes
. Then, buf
could be redefined as 5 * block_size
:
https://github.com/pisa-engine/pisa/blob/35dd03b73506d2089686a4a211e38898671b425c/include/pisa/codec/maskedvbyte.hpp#L21
In same way, the buffers of VarintGB
and VarintG8IU
codecs could be resized as 4 + 1 bytes
, assuming (for simplicity) 1 byte descriptor
for each integer. Currently the buffer size for these codecs is 2 * 4 * block_size
, too.
On another subtopic, after some quick tests I've also found that using thread_local adds a slight speed penalty, can it be? If so, is its use essential when decoding?
I haven't looked into all of them yet, but a few points:
- Since the block size is always 128, I'd replace
thread_local std::vector<uint8_t>
withconstexpr std::array<uint8_t, size>
, which will also require to dostatic constexpr uint64_t block_size = 128;
. - I don't see the reason for
inbuf
either. - I totally agree about
numeric_limits
- I agree with removing the raw loop but we should just use
std::copy
. We should measure just in case, but I'm sure it will be as fast as can be. There are a few problems with that post, which some people also point out in the comment. In short, the compilation was done on-O2
and also there was an implicit type casting that at the moment prevented gcc from optimizing it away (you can see how it can happen, if you try to copy, say, 16-bit integers to an array of 32-bit integers, then you can't just memset as is). From what I see in the comments, even that was already resolved. Anyway, we should have absolutely no issues copying between two arrays of the same type withstd::copy
. That said, we should test if there's any difference.
@amallia can you weigh in on these?
On another subtopic, after some quick tests I've also found that using thread_local adds a slight speed penalty, can it be? If so, is its use essential when decoding?
@gustingonzalez I can think of two ways it could affect performance. (1) because we are using dynamic memory, it is possible that it is implemented by the compiler as a lazy initialization, which would add a check each time. (2) even if not dynamic, thread local array would be still somewhere farther from the top of the execution stack, which may affect caching. On both accounts, it's hard to say without running some experiments.
I just submitted a PR: https://github.com/pisa-engine/pisa/pull/492 As far as I can tell, there's nothing there that could slow anything down; however, I did change vector
to array
in the Simple codecs. This could potentially make it faster due to (1) from above.
I wasn't able to run any benchmarks at the moment. If you could check performance, that would help a lot.
There is nothing about the algorithm that makes thread local necessary. The idea is, as far as I can tell (I didn't write this code myself), is that we don't want to repeatedly allocate memory. If we change it to array
, we won't have to allocate, so we might be able to remove it. But idk how that would affect caching and inlining. We would have to measure.
To be honest, I don't see how just removing thread_local
in the current implementation (with vector
) can improve performance, but maybe I'm missing something...
FYI @JMMackenzie @amallia
I performed some experiments on CW12B. I used CIFF query-only index, exported to PISA, then ran BMW with and without the changes in the related PR. I used Web Track queries from 12-14. For both baseline (master
) and the changed version (improve-block-codecs
) I ran all queries 20 times, and recorded the results. I only ran Simple16 at this time, but the changes I made for Simple8b are almost identical.
It seems that in fact the changes I made improve performance. I'm guessing it's because thread local on heap-allocated variable was doing some sort of lazy initialization which could have slowed it down + it was on the heap as opposed to on the stack as it is now.
Here is a box-and-whiskers plot for different quantiles (and average):
And zoomed in on the average:
I did this on my desktop, but I have to say that given my reasoning and those results, I am fairly convinced that this change at least won't slow us down.
I would suggest implementing static-length std::array
for buffers for the rest of the codecs mentioned in this thread. I will want to run the same test for those as well eventually.
@amallia @JMMackenzie let me know what you think about all this.