cppitertools icon indicating copy to clipboard operation
cppitertools copied to clipboard

New iterator released: shuffled

Open hoxnox opened this issue 9 years ago • 11 comments

Allows iteration over a sequence in shuffled order. Randomization released through Linear Feedback Shift Register. Additional convinient feature - ability to restore iterator state with zero cost (not present in README - see tests).

hoxnox avatar May 11 '16 12:05 hoxnox

Typo in README code example - sorted not replaced with shuffled.

hoxnox avatar May 11 '16 15:05 hoxnox

I like the idea but the implementation seems pretty intense and not the most obvious so I'm having a bit of trouble figuring out why. What are the advantages of all this over creating a sequence of iterators into the container in the ShuffledView constructor, and then std::shuffle()ing it?

ryanhaining avatar May 15 '16 19:05 ryanhaining

It's zero-cost by memory. Every step is just some simple asm operations (mask, shift). If you use std::shuffle - you must have enough memory to store all shuffled set. In the presented case we can shuffle, for example, very big files line-by-line, or rows in big SQL tables (...if we don't care about speed of random seeking through it. If we do - we can use Mixed Product - it's sequential). Off course if all shuffled data already are in the memory - we have random-access iterators (through dumb_advance), so shuffling will be as fast as std::shuffle. I use this method, for example, to shuffle through whole IPv4 address space (with ability to resume) on the machine with 512MB of memory.

hoxnox avatar May 15 '16 20:05 hoxnox

Well that does sound pretty cool. Can I see the IPv4 shuffling code?

ryanhaining avatar May 16 '16 06:05 ryanhaining

In this test I use IPv4 pseudo container and shuffle through it: https://github.com/hoxnox/rators/blob/devel/tests/test_shufipv4.hpp

rators project will be overwritten to use cppitertools::suffled and cppitertools::mixed_product soon

hoxnox avatar May 16 '16 06:05 hoxnox

IPv4 shuffling with help of cppitertools and https://github.com/hoxnox/iptools with zero memory cost:

    for (auto i : iter::shuffled(cidr_v4("0.0.0.0/0")))
        void(0);

hoxnox avatar May 18 '16 11:05 hoxnox

I'm sorry this is taking me so long to get through. I'm working on just language stuff for the most part at this time. You have a point in your test file to make sure "shuffled not store container inside" but it needs to in the case of an rvalue

auto s = iter::shuffle(std::vector<int>{1, 2, 3});
// okay now s has an iterator into that vector
// but the vector temporary has already been destroyed
auto it = std::begin(s);
*it; // use-after-free

Why did you make a point of this? I've made the change on my checkout on the branch but I want to make sure I'm not missing something.

Why do we need the check at the top of operator++ to see if *this == owner->end()?

What is being gained by having Distance be a template when we're aiming to deduce in the function calls?

ryanhaining avatar May 21 '16 23:05 ryanhaining

  1. I've copied tests from sorted. The test moves rvalues and binds to lvalues hadn't passed "as is" because shuffled doesn't store container inside. So I made a note about it. We can just delete this test.
  2. We need to stay at end() when operator++ applies to end(). Yes, technically we can shift LSFR with state 0 and it will stay at 0, so we doesn't need to check if *this == owner->end().
  3. We need to store the size of the container. So we need to know the type. We can use std::iterator_traits::difference_type, but it will impose additional requirements to iterators. I haven't enough experience to make decision how to do it better.

hoxnox avatar May 22 '16 08:05 hoxnox

  1. I'm saying it needs to store the container because it is currently callable with a temporary. Try running the example under valgrind to see why. This was an issue with itertools a couple of years ago and has since been fixed.
  2. The check isn't necessary in that case because it's UB to increment an iterator passed the end. Nothing else has this check.
  3. I get that, but why template it if it's always size_t anyway?

ryanhaining avatar May 22 '16 18:05 ryanhaining

You convinced me. =)

hoxnox avatar May 23 '16 04:05 hoxnox

I tested the code in 2 projects. It successfully runs in production under Gentoo-amd64, Debian-jessie-x64 and Debian-jessie-i386.

hoxnox avatar Jun 01 '16 12:06 hoxnox