Seqlock icon indicating copy to clipboard operation
Seqlock copied to clipboard

Why no fetch_add ?

Open X-Ryl669 opened this issue 5 years ago • 12 comments

Instead of 2 stores, why don't you fetch_add(1) twice ? Also, the initial code would break if 2 writers were to write at the same time since you don't check the result of the first load. Typically, to be more safe, the (pseudo) code should be

write(const T & value)
{
    int seq = counter.load(relaxed);
    while (seq.CASweak(seq, seq+1, acq_rel));
    if (seq & 1)
    {  // Need to wait with a CAS until it's even again }
    copy
    counter.fetch_add(1, release);
}

But doing so make it likely much slower.

X-Ryl669 avatar Sep 13 '18 16:09 X-Ryl669

Actually c++11 atomic is an overkill, fetch_add is an overkill, and even normal compiler memory fence is an overkill, if you read Chinese you can check this post for the reason.

So IMO, the most concise and efficient Seqlock implementation could look like this:

template<typename T>
class Seqlock {
public:
    T load() const {
        T copy;
        std::size_t seq0, seq1;
        do {
            asm volatile("" : "=m"(seq_) : :); // force read memory
            seq0 = seq_;
            asm volatile("" : "=m"(seq_), "=m"(value_) : :); // memory fence
            copy = value_;
            asm volatile("" : "=m"(seq_), "=m"(value_) : :); // memory fence
            seq1 = seq_;
        } while(seq0 != seq1 || seq0 & 1);
        return copy;
    }

    void store(const T& desired) {
        seq_++;
        asm volatile("" : : "m"(seq_), "m"(value_) :); // memory fence
        value_ = desired;
        asm volatile("" : : "m"(seq_), "m"(value_) :); // memory fence
        seq_++;
        asm volatile("" : : "m"(seq_)); // force write memory
    }

private:
    // actually there's no need to align with cacheline size as seq and value are always updated together
    std::size_t seq_ = 0;
    T value_;
}

MengRao avatar Sep 25 '18 04:09 MengRao

@X-Ryl669

  1. That will work, but it as a stronger operation than needed and will be slower.
  2. Only a single writer is supported.

rigtorp avatar Oct 10 '18 03:10 rigtorp

@MengRao

You're solution will work on TSO memory model, it will not work for ARM. Even for amd64 your solution and mine will not work if the assignment operator uses non-temporal stores or string instructions. See Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide section 8.2.2

I think the generic solution is to require T to be trivially copyable and use a version of memcpy that is guaranteed to not use non-temporal stores.

In any case it is currently impossible to implement seqlocks under C++11 memory model in a way that is portable to all CPU architectures.

The goal of this project was to try and express this concurrency primitive in portable C++11, conclusion is that it's not possible.

rigtorp avatar Oct 10 '18 03:10 rigtorp

@MengRao It's also needed to align and pad the seqlock to avoid false sharing. If you have array/vector of seqlocks with multiple writer threads for different seqlocks you need to avoid false sharing.

rigtorp avatar Oct 10 '18 04:10 rigtorp

@MengRao It's also needed to align and pad the seqlock to avoid false sharing. If you have array/vector of seqlocks with multiple writer threads for different seqlocks you need to avoid false sharing.

You're right, so putting an "alignas(64/128)" before the first member variable is enough, and the padding is automatic(you can check sizeof(Seqlock)).

MengRao avatar Oct 10 '18 04:10 MengRao

@rigtorp It's portable, as long as you're using atomic increment, CAS or, at least, swap. But doing so, you'll see that there's no need to increment anything twice, and you can do as any other spinlock:

struct Lock
{
    atomic<bool> l;
   void acquire()  { while (l.swap(true)); }
    void release() { l.store(false, release);  }
};

Since reading is not wait-free, the only advantage of seqlock is to have wait free writing. Provided that only a single writer is allowed, you can replace this with a spinlock that's only taken by the writer and another one that's taken by both (R/W). Typically the scheme is: For reading:

  1. Increment reader atomic count
  2. If writer lock is free 2.1 read the structure 2.2 Decrement reader atomic count 2.3 Done
  3. Else, spin on the lock (via CAS), goto 2

For writer:

  1. Take the writer lock
  2. Check reader counter (atomic load)
  3. If non 0, spin on the the reader counter until it's 0
  4. Modify protected object
  5. Release writer lock

Point 3 is not wait-free, but fortunately, it's short so it should be ok.

X-Ryl669 avatar Oct 10 '18 08:10 X-Ryl669

@MengRao It's also needed to align and pad the seqlock to avoid false sharing. If you have array/vector of seqlocks with multiple writer threads for different seqlocks you need to avoid false sharing.

You're right, so putting an "alignas(64/128)" before the first member variable is enough, and the padding is automatic(you can check sizeof(Seqlock)).

Yes but support for alignas with over alignment (>8 bytes on x86) is a very recent addition to compilers (required by C++17 i think). On old compilers it will just be ignored.

rigtorp avatar Oct 11 '18 04:10 rigtorp

@X-Ryl669 Yes it's possible to implement other types of locks, but then it's no longer a seqlock. The whole point here was to implement seqlocks.

rigtorp avatar Oct 11 '18 05:10 rigtorp

And the advantage of a seqlock over e.g. a reader/writer lock or the lock design above is that with the seqlock, readers don't have to modify some shared location which would otherwise create a scalability bottleneck. This is a very useful property when you have shared data structures that are often read (by many threads) and seldom written. Networking has many such data structures.

WonderfulVoid avatar Jan 18 '19 19:01 WonderfulVoid

Even for amd64 your solution and mine will not work if the assignment operator uses non-temporal stores or string instructions.

will this implementation work on x86_64 or only on x86 ??

asdzxcv avatar Jan 17 '21 16:01 asdzxcv

will this implementation work on x86_64 or only on x86 ?? Both

rigtorp avatar Feb 02 '21 20:02 rigtorp

@rigtorp

In any case it is currently impossible to implement seqlocks under C++11 memory model in a way that is portable to all CPU architectures.

Just to be clear you said that, meaning, "... (without having to submit a stronger fence like seq_cst)", because that'd emit a stronger (and suboptimal) fence in x86 like lock prefix, right?

Deanseo avatar Apr 09 '24 11:04 Deanseo