cppitertools
cppitertools copied to clipboard
New iterator released: shuffled
Allows iteration over a sequence in shuffled order. Randomization released through Linear Feedback Shift Register. Additional convinient feature - ability to restore iterator state with zero cost (not present in README - see tests).
Typo in README code example - sorted not replaced with shuffled.
I like the idea but the implementation seems pretty intense and not the most obvious so I'm having a bit of trouble figuring out why. What are the advantages of all this over creating a sequence of iterators into the container in the ShuffledView constructor, and then std::shuffle()ing it?
It's zero-cost by memory. Every step is just some simple asm operations (mask, shift). If you use std::shuffle - you must have enough memory to store all shuffled set. In the presented case we can shuffle, for example, very big files line-by-line, or rows in big SQL tables (...if we don't care about speed of random seeking through it. If we do - we can use Mixed Product - it's sequential). Off course if all shuffled data already are in the memory - we have random-access iterators (through dumb_advance), so shuffling will be as fast as std::shuffle. I use this method, for example, to shuffle through whole IPv4 address space (with ability to resume) on the machine with 512MB of memory.
Well that does sound pretty cool. Can I see the IPv4 shuffling code?
In this test I use IPv4 pseudo container and shuffle through it: https://github.com/hoxnox/rators/blob/devel/tests/test_shufipv4.hpp
rators project will be overwritten to use cppitertools::suffled and cppitertools::mixed_product soon
IPv4 shuffling with help of cppitertools and https://github.com/hoxnox/iptools with zero memory cost:
for (auto i : iter::shuffled(cidr_v4("0.0.0.0/0")))
void(0);
I'm sorry this is taking me so long to get through. I'm working on just language stuff for the most part at this time. You have a point in your test file to make sure "shuffled not store container inside" but it needs to in the case of an rvalue
auto s = iter::shuffle(std::vector<int>{1, 2, 3});
// okay now s has an iterator into that vector
// but the vector temporary has already been destroyed
auto it = std::begin(s);
*it; // use-after-free
Why did you make a point of this? I've made the change on my checkout on the branch but I want to make sure I'm not missing something.
Why do we need the check at the top of operator++ to see if *this == owner->end()?
What is being gained by having Distance be a template when we're aiming to deduce in the function calls?
- I've copied tests from
sorted. The testmoves rvalues and binds to lvalueshadn't passed "as is" becauseshuffleddoesn't store container inside. So I made a note about it. We can just delete this test. - We need to stay at
end()whenoperator++applies toend(). Yes, technically we can shift LSFR with state 0 and it will stay at 0, so we doesn't need to check if*this == owner->end(). - We need to store the size of the container. So we need to know the type. We can use
std::iterator_traits::difference_type, but it will impose additional requirements to iterators. I haven't enough experience to make decision how to do it better.
- I'm saying it needs to store the container because it is currently callable with a temporary. Try running the example under valgrind to see why. This was an issue with itertools a couple of years ago and has since been fixed.
- The check isn't necessary in that case because it's UB to increment an iterator passed the
end. Nothing else has this check. - I get that, but why template it if it's always
size_tanyway?
You convinced me. =)
I tested the code in 2 projects. It successfully runs in production under Gentoo-amd64, Debian-jessie-x64 and Debian-jessie-i386.