pisa
pisa copied to clipboard
Disk-streamed indexing can result in non-deterministic indexes if there is insufficient /tmp storage
Describe the bug
As the title suggests; I had some issues with segfaults upon mmapping an index. It turned out that the index was being built (but incorrectly) because there was insufficient storage space on /tmp
. No errors were thrown though.
Running the indexing command with the --check
flag threw an error (demonstrating that the index built was wrong).
I will look into this in more detail soon, but am just marking it here for now.
+1, just had the same error for a while, thanks @JMMackenzie for pointing this out. Going for a machine with more /tmp space solved the problem but without that I would have no idea that was the problem.
@elshize I think we'd need to run a check somewhere here: https://github.com/pisa-engine/pisa/blob/196a6e6e0b2312fad5befa58e613be4485294aa6/include/pisa/block_freq_index.hpp#L154
I can never remember the correct way to do it, but I think we need to flush and then check for .bad()
or something like that?
@JMMackenzie yeah, something like this. Didn't have time to have a look today. I'll have to refresh my memory. There are many places like this where we should do those kind of checks tbh. We weren't very good about this. In our defense, this is way more complicated to do than it should be in C++. But it is what it is, we need to address that.
@JMMackenzie I think this should be actually resolved, because the issue is about the disk-streamed indexing, which was addressed in #484
There are more places where checks should be done, but I don't see much sense in tracking all of them here. Let's close, and we'll open specific ones (or one tracking issue) instead.