bee
bee copied to clipboard
Code 500 during /bzz POST with pinning
Context
Running Bee 1.7.1, I'm posting a large file (~430 MB) with bzz and pinnig enabled. I've uploaded another large files before this of ~900MB.
Summary
I get this log from node:
"time"="2023-08-05 22:25:02.359854" "level"="info" "logger"="node/api" "msg"="api access" "ip"="<myIP>" "method"="POST" "host"="<myHost>" "uri"="/bzz" "proto"="HTTP/1.1" "status"=500 "size"=47 "duration"="10m17.142678591s" "x-forwarded-for"="10.0.0.37, 10.0.8.31"
I don't know how to access debug logs.
Expected behavior
Upload should succeed.
Actual behavior
Upload fails with error 500.
Steps to reproduce
Not sure, try to upload large files pinning them.
I'm receiving now error 500 with any file. Also fresh new, and of little dimension (~2MB). Attaching logs: logs.zip
Why are you attempting to pin a zero reference? That will never succeed.
"time"="2023-11-15 12:03:19.567727" "level"="error" "logger"="node/api/post_pin" "msg"="pin collection failed" "error"="traversal: failed to get root chunk 0000000000000000000000000000000000000000000000000000000000000000: invalid address queried"
"time"="2023-11-15 12:03:19.568165" "level"="info" "logger"="node/api" "msg"="api access" "ip"="10.0.8.111" "method"="POST" "host"="bee0:1633" "uri"="/pins/0000000000000000000000000000000000000000000000000000000000000000" "proto"="HTTP/1.1" "status"=500 "size"=48 "duration"="4.393158ms
Why are you attempting to pin a zero reference? That will never succeed.
/pins/0000000000000000000000000000000000000000000000000000000000000000
I know... I've to investigate it, but is related to our automatic management system. Anyway, it shouldn't be a problem.
Actually, it looks like this is still the underlying original error:
"time"="2023-11-15 12:08:18.833068" "level"="debug" "logger"="node/api/post_bzz" "msg"="store dir failed" "error"="read dir stream: content-length missing"
"time"="2023-11-15 12:08:18.833356" "level"="error" "logger"="node/api/post_bzz" "msg"="store dir failed"
"time"="2023-11-15 12:08:18.840913" "level"="info" "logger"="node/api" "msg"="api access" "ip"="10.0.8.83" "method"="POST" "host"="gateway.etherna.io" "uri"="/bzz" "proto"="HTTP/1.1" "status"=500 "size"=52 "duration"="10.547959ms" "x-forwarded-for"="149.88.27.220, 10.0.0.38, 10.0.8.80"
@tmm360 It seems the attached file does not indicate the initial failure. It seems that something was stuck on the original error and could not move forward.
In order to assess this, can you reproduce it and attach the original failure logs please?
Hi @tmm360. As @nikipapadatou said, it appears that the error from the dir upload in logs is a bit different as the response took ~10 milliseconds while the response duration in the issue body is ~10 minutes, which indicated to a different type of problem.
The upload error that @ldeffenb found in the logs is about missing http request content-length header for a multipart encoded upload. But that is an instant error from the initial check in contrast to a 10 minute long response from the initial failure.
Unfortunately I'm not able to reproduce methodically, and I don't know if trying on an ex-novo node it would be reproduced. This is from an our node in production, that has traversed several bee versions. Previously I've verified local storage about corrupted chunks, and I've found many. This could be a starting point. Anyway, I also know that on last versions a cause of corruption has been fixed (concurrent uploads). If is not simple to find the cause here, and if a chunk fixer will not be deployed soon (admitting that it could be the root cause), I could try to migrate my pinned contents on a fresh node, and try again.
Corrupted chunks in your localstore/sharky can definitely cause random errors in other APIs that need to access those chunk references. I WILL be putting together some method of removing the corrupted chunks as soon as I get some time to put it together. It may be a multi-step and manual process, but migrating my pinned contents to another node is not an option for me due to the amount of data.
My tentative plan is to restore a -X DELETE /chunks API that will do (at least) the following: a) Remove the chunk from the retrieval index b) If there is no retrieval index entry for the actual hash returned from sharky, then remove the reference from sharky c) Scan all related other indices and remove the chunk from those indices as well. d) If the chunk is a primary pin reference, then unpin the entire set e) If the chunk is a secondary pin reference, then remove the entire parent pin set.
And of course, include extensive logging on all actions taken by this API!
I think that it should try to recover corrupted chunks of pinned contents from the network, before otherwise remove the pin from content. I have a lot of pinned contents too, it could be done only with an automatic process.
I think that it should try to recover corrupted chunks of pinned contents from the network, before otherwise remove the pin from content. I have a lot of pinned contents too, it could be done only with an automatic process.
You won't be able to retrieve the corrupted ones into the node that has the corruption, but you should certainly be able to simply pin the references into another, non-corrupted node. Write a script that queries the pins from the corrupted node and create the pin in the new node. Just make sure you have good logging to know which new pins were successful and which might fail. I've found that retrying the failed retrievals/pins (up to 10 times!) can sometimes get a successful operation.
@tmm360 do you happen to have the localstore folder of the instance you were able to reproduce this bug last time?
@notanatol yes, I've the localstore folders of both nodes, but they are very big. 281GB and 245GB. If you write me at mirko(at)etherna.io we can organize a way to pass it to you
@notanatol unfortunately I've tried to upload also an a fresh new node, without any pinning in it, but i still get this error. Maybe a tcpdump in front of node could be useful to identify the issue?
@notanatol unfortunately I've tried to upload also an a fresh new node, without any pinning in it, but i still get this error. Maybe a tcpdump in front of node could be useful to identify the issue?
I doubt it, it's likely related to node's internal state. What are you trying to upload? can you share that file? maybe I'll be able to reproduce it
@notanatol ok I will send you a link to it by email
I've found solution to my issue. I was trying to upload a file with multipart/form-data
, but http client didn't include a ContentLength header for each file in request payload. It shouldn't be required by http standard (my client wasn't adding it), but bee requires it. This was causing to hit the error "content-length missing", exposed by debug logs.
I've found solution to my issue. I was trying to upload a file with
multipart/form-data
, but http client didn't include a ContentLength header for each file in request payload. It shouldn't be required by http standard (my client wasn't adding it), but bee requires it. This was causing to hit the error "content-length missing", exposed by debug logs.
perhaps we should change it to a 400
and provide details in the payload