bee icon indicating copy to clipboard operation
bee copied to clipboard

1.5.0rc3 Pusher repeats same chunk over and over

Open ldeffenb opened this issue 2 years ago • 4 comments

Context

bee 1.5.0rc3 89021656 on testnet under Windows

Summary

I have a large push queue waiting for a series of uploads. After finally successfully delivering a bunch of chunks, the pusher is now retrying a bunch of chunks over and over again. The error is mostly "storing outside of the neighborhood".

Expected behavior

I would expect some number of retries and then just keep the chunk locally until the pullsyncer manages to get it distributed to an interested node.

Actual behavior

The chunk is repeated over and over. The log for one of the chunks can be viewed at: https://gateway.ethswarm.org/access/0adfe66805f6748745ebfb1fd2f3557b68b1571ab8f89e98bd48c1f48d279c27

Here's my stream-of-consciousness conversation from https://discord.com/channels/799027393297514537/834408095689343016/951487276096061500

rc3 is looking (mostly) good after running now for over 3 hours. It finished my pusher queue and has gone back to the start to pick up the error'd chunks.

And that's the reason for my "mostly". Apparently, my pusher node has a bunch of chunks that flatly refuse to push into the swarm. I don't know if it's a latent stamp issue (only 1.10% utilized with a TTL of 9134470710 or >289 years) or what. The "pusher: cannot push chunk" logs are saying "storing outside of the neighborhood", whatever that means. It seems to be primarily 0xfd... chunks logging that at this time. For reference, the pusher node is 0x7cd... and I have a 0xade... testnet node as well. There's a few 0x815... and 0xd0... chunks showing up sometimes, but most of them are 0xfd... And my pusher node has a depth of 2, if that enters into the consideration

And of course, now that I typed that, I just got a big burst of "stream reset" errors from the pusher as well.

So, looking at pushsync.go, I see that ClosestPeer has returned ErrWantSelf, but IsWithinDepth says the chunk address is not something we should be keeping. This causes it to skip pushToNeighbourhood and, because an error is returned from pushToClosest, the pusher leaves the chunk in the push queue for later retry.

IsWithinDepth simply asks if the proximity for the chunk to the local node is >= the node's current depth. For 0xfd chunks in a 0x7c node, the proximity is zero (leading bit doesn't match), which is certainly not >= the node's depth of 2. But apparently ClosestPeer thinks the chunk should be stored locally via the ErrWantSelf return. Checking /topology's bin 0 now.

Connected to 17 peers in bin 0: 9f654... c9c6c... f320f... 81018... ba545... ae9dc... 91796... a3b78... d61a7... 8d2b1... bc5a5... ade6b... e6ca5... dfd86... 9b7e... a75cc... dc71f...

Hard to believe that it has tried all of those nodes and still didn't find a place for the chunks to land? Let me see the pusher is actually stuck doing the same chunks over and over and over again or something similarly stoopid.

A snapshot of my pusher node's topology can be found at https://gateway.ethswarm.org/access/aa0513f2b508d5bbae48920c2c411aa54c43fb570b57bd281a21c7277569ccdf

Steps to reproduce

I'm guessing it takes the right topology with the right kind of nodes that refuse to accept a given chunk causing ClosestPeer to eventually return ErrWantSelf, but yet the chunk is demonstrably outside the pusher node's local neighborhood depth.

Possible solution

Maybe put some sort of retry in the pusher on the ErrOutOfDepthStoring and finally just remove the chunk from the push queue when that error has been seen on a given chunk too many times. Similar to what pushToClosest does with fullSkipList but not peer-specific?

ldeffenb avatar Mar 10 '22 15:03 ldeffenb

If the stamp was invalid the chunk would actually be marked sync. This was one of issues we fixed where chunks in the push queue were being tried again and again and failing due to invalid stamps.

From your logs, I do see 17 peers being tried for the push and all 17 of them failing. In pushsync the skiplist blocks a peer from being retried for 5 mins. This is also working as the retries for a particular peer are at 5 mins intervals

time="2022-03-10T09:46:13-05:00" level=debug msg="pushsync: could not push to peer f320fb810fe20dc35777814750a428e67443202f80521951b4bd64f995c23728: chunk fd3fcb270a5d92f093f1744cd511d94dbceccd3f48eb384acf8b69a35fc5eb8c receive receipt from peer f320fb810fe20dc35777814750a428e67443202f80521951b4bd64f995c23728: stream reset"

time="2022-03-10T09:51:14-05:00" level=debug msg="pushsync: could not push to peer f320fb810fe20dc35777814750a428e67443202f80521951b4bd64f995c23728: chunk fd3fcb270a5d92f093f1744cd511d94dbceccd3f48eb384acf8b69a35fc5eb8c receive receipt from peer f320fb810fe20dc35777814750a428e67443202f80521951b4bd64f995c23728: stream reset"

Why the other nodes refuse to store the chunk is a different issue. But seems like the code is working as expected. Adding a pusher level cache for failed chunks will definitely prevent this and we will discuss this internally. But from the perspective of the release, this is a non-blocker. I believe we would see this even with v1.4.3. There are no changes around these packages in the current release. Do you agree?

I would still like to release and then we can work on this issue for v1.5.1 which is just around the corner!

aloknerurkar avatar Mar 11 '22 07:03 aloknerurkar

I do agree that this can be considered a non-blocker, and I also suspect that the issue is not new. It's just that I've been working the much larger/deeper mainnet with my uploads for so long and I suspect this is far less likely to appear there.

Also, most of the time I run my connectivity-enhanced version for uploading, but am running the stock rc3 with fewer connections for this testing. Not that there's many more nodes even available in the testnet to connect to...

But in the end, it is definitely a problem and needs to be fixed. This pusher node is STILL pushing the same chunks out. Can I also suggest that some method of communicating a push reject reason back to the client be considered. Right now it is very opaque and because of that, very difficult to debug push rejections.

ldeffenb avatar Mar 11 '22 10:03 ldeffenb

@ldeffenb is this happening on the testnet? can you still reproduce the problem?

acud avatar Apr 19 '22 19:04 acud

nm I can reproduce this easily on the testnet... will look into this...

acud avatar Apr 19 '22 23:04 acud