dione
dione copied to clipboard
No need for caching in sorted-iterator
Summary
In #71 we changed the block order. while checking the full iteration on Object Store (S3), we saw it is amazingly slow.
What happened is that it switched to Random IO mode:
.. 15:42:50,084 INFO fs.s3a.S3AInputStream - Switching to Random IO seek policy
Detailed Description
After debugging it looks like it happened because we try to seek back compared to the "natural" readahead buffer.
Luckily, now that we changed the block order we can just read the file sequentially without hopping at all. So just removing the seeking part works!
How was it tested?
regular unit tests
What... How is that possible? When you reach a leaf block, you have to go back and seek elsewhere, no?
I was also surprised that it just worked, but what happens is just that the parent blocks are saved in memory, and every time you need to read the next block from the file it is always the "correct" block as we saved it pre-order.
Wow
Are the parents explicitly cached by us? Do we make sure to uncache them? Just making sure there is no memory leak or something
good point. I actually thought about that and forgot to check. but it looks like we're good. the sorted-iterator only has one pointer to Node next and the Node object only has one pointer to parent. So we will maximum log(n) nodes in memory.
BTW, I also need to verify this "Random IO" in the "search" iterator (get()
).
ok.. so after looking at get(key)
, I saw that we can hop backwards there if we were calling get() multiple times.
So I added assertion to allow only bigger keys, and added functionality to reuse the current state.
Is this ready for release? Is it good for all three of our versions?