go-threads Found an interesting bug related to synchronisation of threads

Lately we have been experiencing an interesting bug in go-threads. The problem occurs when we have a lot of records and we are trying to synchornize them.

In our application happens as follows:

Device A has thread with 0 records and it tries to get a lot of records from Device B regarding Log B (for example 1000)
At the same time Device A requests Head of Log B through GetRecord effectively getting it through Bitswap. As a result of this the Head of Log B is added to Blockstore.
When Device B returns 1000 records to Device A, Device A tries to put them using putRecords. So it checks them using isKnown which looks into blockstore and sees that we already have the Head of Log B (which we got through bitswap), so it decides not to proceed with saving the records.
So the Head of Log B in Device A stays as Undef as we haven't updated it.
So ExchangeEdges will see that we have different heads and will try to get records again. Thus we will have an infinite loop of unsuccesfull record updates.

I tried to reproduce this isKnown problem on testground:

First instance creates a thread and adds 200 records
Second instance starts after (so it doesn't receive anything through pubsub or push) and pulls the thread thus calling get records.
At the same time Second instance tries to get the head through bitswap using GetRecord.
Because Second instance receives head earlier than the batch of 200 records it will effectively skip adding them and thus will not update its head.

The test is not of super good quality, but at least it sometimes works :-) Feel free to ping me in case I missed something.

https://github.com/mcrakhman/go-threads/tree/sync-bug-demonstration

To run the test type testground run composition --wait -f sync-bug-exec.toml.

When the test stops running you will see some prints in the console like: got log 12D3KooWAhDczANHAuX8JgqDofkmhZsdftzfS9Q7GTt3kvqDECbB with head b (see GetHeads method)

You can observe different behaviour in case there will be not 200 but 3 records. In this case the log of First instance will be in sync with log of Second instance.

Jun 20 '21 21:06 mcrakhman

Does this sound familiar, given your recent tests using test ground @merlinran ?

Jun 21 '21 12:06 carsonfarmer

At the same time Device A requests Head of Log B through GetRecord effectively getting it through Bitswap. As a result of this the Head of Log B is added to Blockstore.

IIUC what triggers this bug is that Device A gets the log head via a 3rd party channel and calls GetRecord before the records being synced (I tried the same when drafting the first Testground test, that's why SharedInfo.LogHead exists but was not used!). It's probably not an initially envisioned use case of go-threads, but definitely a legit one.

What if there's a new API call so Device A can explicitly tell go-threads that this is the log head so the data structure gets set properly? Far from an elegant solution but it introduces no performance penalty.

Jun 21 '21 14:06 merlinran

What if there's a new API call so Device A can explicitly tell go-threads that this is the log head so the data structure gets set properly? Far from an elegant solution but it introduces no performance penalty.

Can you elaborate please?

What I initially thought that as far as I understand SetHead preserves the invariant that before the head there are no gaps in the records. Therefore we can introduce some kind of counter just for the head. Obviously because we set new head just after the previous one we can always maintain the counter just by incrementing it every time we set new head.

Thus if every log has a counter for its head we can easily compare them (at the time we get records) to see if we need to accept more records and how many records should be between the heads. It seems it would be easy to implement such a solution if I am not mistaken.

Jun 21 '21 15:06 mcrakhman

Sorry my understanding of the codebase is still far from complete, and my proposal simply breaks the invariant of calling SetHead which would cause more headaches.

Having the counter for log depth sounds like a brilliant idea! Naive question: how can a peer get the expected value of the counter for a log? Does it mean that each record brings the counter?

Jun 21 '21 16:06 merlinran

Yes, the log depth counter is what we had in mind for our vector clock proposal exactly. I'm a big fan of this simple feature, because it actually buys us a lot of features in terms of peer synchronization beyond just this bug here. It would require an update to the head data structure, which, while minimal, would likely require some thought. Now is a good time to do this though, as we already have some breaking changes in the pipeline.

Jun 21 '21 16:06 carsonfarmer

Since only one peer is able to write to a given log, if you get a head from a peer, you know that that head counter value is correct (as far as that head record is concerned). This also means you can compare your head counters with another peer's head counters to get all the semantics that come with vector clocks.

Jun 21 '21 16:06 carsonfarmer

Do you think it makes sense for me to draft some proposal, so we would look at it on Thursday? And maybe you would already give me some comments

Jun 21 '21 17:06 mcrakhman

That would be fantastic!

Jun 21 '21 17:06 carsonfarmer

I reference the PR here (https://github.com/textileio/go-threads/pull/531), it is not finished and a lot of the stuff there is not final. The only thing which is implemented is the get records logic which fixes the bug. I didn't have time to check other things yet (including to properly test it, except for the testground test which is working for me), I will do that after our discussion on Thursday.

Jun 23 '21 19:06 mcrakhman

So here are some things to note:

I still need to update AddHeads method which should be easy. Though I see that we don't use multihead logs anywhere. Is that a real use case?
We should probably do a small migration to add head counters for the logs which currently don't have them (for example if a person had an older version). Haven't implemented that though.
Initially I added the counter to push record, for example if a somebody creates a record we obviously know the counter for that record and thus we can send it. In some cases it is impossible though (for example if we AddRecord), so we can just send a default value. Should we use this logic for a push record or should we omit it altogether?
Not sure if I did a proper backward compatibility checks everywhere (if counter value is zero we use old isKnown checks)

Jun 23 '21 20:06 mcrakhman