bee icon indicating copy to clipboard operation
bee copied to clipboard

Migrated 1.5.0rc1 DELETE /pins/reference hangs

Open ldeffenb opened this issue 2 years ago • 4 comments

Context

An existing testnet node with 70,000+ pinned chunks was upgraded to 1.5.0 rc1. Attempting to delete an existing pin hangs.

Summary

I decided to remove all of my pins, so I did a GET /pins to get the list, and a DELETE /pins/{reference} to delete one, but the latter call never returns. Duplicated with curl as well.

Expected behavior

Expected the pin to be deleted and a response issued.

Actual behavior

No trace of any activity in a --verbosity 4 output log, but it also never get a response from the DELETE /pins request.

Steps to reproduce

I don't know if a newly pinned chunk can be deleted, but if you've upgraded a node that has pinned chunks, I suspect you won't be able to delete them.

Possible solution

ldeffenb avatar Mar 02 '22 13:03 ldeffenb

@ldeffenb Thanks for reporting! How big is your statestore? Would it be possible to share it?

aloknerurkar avatar Mar 03 '22 15:03 aloknerurkar

My localstore is 434.8GB, 400.1GB of which is in sharky and statestore is 69,2MB. Which is it that you would be needing? Obviously the former would be hard to share.

ldeffenb avatar Mar 03 '22 19:03 ldeffenb

Yes considering the no of chunks sharing the sharky dir seems not feasible... I want to check the statestore (~70MB) to see if I can figure out anything. Also if you can send me the references you used. I have this statestore explorer, hopefully, it still works! It may shed some light.

aloknerurkar avatar Mar 04 '22 06:03 aloknerurkar

Well, I restarted the node for other reasons, and then queried the pins from the restarted node. There are over 860,000 pins. But when I tried to delete the first few in the list, the behavior changed. Instead of hanging on the delete, I'm now getting a 500 Internal Server error. The verbosity 5 logs around that are (scroll right to see the "traversal: unable to process bytes" root cause):

time="2022-03-04T08:23:18-05:00" level=debug msg="unpin root hash: deletion of pin for \"00002ff2e6923c0689e5225859a78570458ee48eb2bae6225f5b53b42435edd3\" failed: traversal of \"00002ff2e6923c0689e5225859a78570458ee48eb2bae6225f5b53b42435edd3\" failed: 1 error occurred:\n\t* traversal: unable to process bytes for \"00002ff2e6923c0689e5225859a78570458ee48eb2bae6225f5b53b42435edd3\": manifest iterate addresses: storage: not found\n\n"
time="2022-03-04T08:23:18-05:00" level=error msg="unpin root hash: deletion of pin for failed"
time="2022-03-04T08:23:18-05:00" level=info msg="api access" duration=8.2929485 ip=192.168.10.177 method=DELETE proto=HTTP/1.1 size=48 status=500 uri=/pins/00002ff2e6923c0689e5225859a78570458ee48eb2bae6225f5b53b42435edd3 user-agent=curl/7.79.1

And the same logs are shown for each of the pins I try to delete. I even tried the last one in the list and got the same messages. So it seems that at least these pins are manifest chunks that have missing chunks somewhere within the manifest hierarchy causing the traversal to fail and subsequently the pin to not be deleted. I wish the traversal error would say which chunk was not found instead of simply logging the root chunk ID. (Note: I did a /chunks retrieval on the IDs and at least the root chunk is there, but then it would have to be to even trigger the manifest traversal)

So back to the original report, I'm now suspecting that the delete wasn't actually "hung" but was in fact traversing a very large manifest trying to recursively unpin all of the chunks inside the manifest and its contents. This node has part of my OSM map tile set, so there are lots and lots of manifest nodes indexing lots of small PNG files. And each of the individual manifest nodes are actually individually and explicitly pinned by my uploader, so likely 50% or more of the pinned chunks are manifest nodes at some level of the hierarchy.

So, if you're still interested in the data, I can try to get it to you (likely via IPFS), but I believe "chunk rot" within the manifest nodes can explain both the originally reported "hang" behavior (there are no logs during the manifest traversal) and the current "internal server error" caused by the failed traversal.

While this is definitely not good behavior (inability to delete pins for incomplete manifest nodes), it's better than simply hanging with no logs while a large manifest unpin operation is attempted.

ldeffenb avatar Mar 04 '22 13:03 ldeffenb

@ldeffenb Is it safe to close this now?.. I think any new problems it would be better to have new issues. Closing this for now.

aloknerurkar avatar Aug 21 '23 10:08 aloknerurkar