bee icon indicating copy to clipboard operation
bee copied to clipboard

Code 500 during /bzz POST with pinning

Open tmm360 opened this issue 1 year ago • 12 comments

Context

Running Bee 1.7.1, I'm posting a large file (~430 MB) with bzz and pinnig enabled. I've uploaded another large files before this of ~900MB.

Summary

I get this log from node:

"time"="2023-08-05 22:25:02.359854" "level"="info" "logger"="node/api" "msg"="api access" "ip"="<myIP>" "method"="POST" "host"="<myHost>" "uri"="/bzz" "proto"="HTTP/1.1" "status"=500 "size"=47 "duration"="10m17.142678591s" "x-forwarded-for"="10.0.0.37, 10.0.8.31"

I don't know how to access debug logs.

Expected behavior

Upload should succeed.

Actual behavior

Upload fails with error 500.

Steps to reproduce

Not sure, try to upload large files pinning them.

tmm360 avatar Aug 06 '23 00:08 tmm360

I'm receiving now error 500 with any file. Also fresh new, and of little dimension (~2MB). Attaching logs: logs.zip

tmm360 avatar Nov 15 '23 12:11 tmm360

Why are you attempting to pin a zero reference? That will never succeed.

"time"="2023-11-15 12:03:19.567727" "level"="error" "logger"="node/api/post_pin" "msg"="pin collection failed" "error"="traversal: failed to get root chunk 0000000000000000000000000000000000000000000000000000000000000000: invalid address queried"
"time"="2023-11-15 12:03:19.568165" "level"="info" "logger"="node/api" "msg"="api access" "ip"="10.0.8.111" "method"="POST" "host"="bee0:1633" "uri"="/pins/0000000000000000000000000000000000000000000000000000000000000000" "proto"="HTTP/1.1" "status"=500 "size"=48 "duration"="4.393158ms

ldeffenb avatar Nov 15 '23 12:11 ldeffenb

Why are you attempting to pin a zero reference? That will never succeed.

/pins/0000000000000000000000000000000000000000000000000000000000000000

I know... I've to investigate it, but is related to our automatic management system. Anyway, it shouldn't be a problem.

tmm360 avatar Nov 15 '23 12:11 tmm360

Actually, it looks like this is still the underlying original error:

"time"="2023-11-15 12:08:18.833068" "level"="debug" "logger"="node/api/post_bzz" "msg"="store dir failed" "error"="read dir stream: content-length missing"
"time"="2023-11-15 12:08:18.833356" "level"="error" "logger"="node/api/post_bzz" "msg"="store dir failed"
"time"="2023-11-15 12:08:18.840913" "level"="info" "logger"="node/api" "msg"="api access" "ip"="10.0.8.83" "method"="POST" "host"="gateway.etherna.io" "uri"="/bzz" "proto"="HTTP/1.1" "status"=500 "size"=52 "duration"="10.547959ms" "x-forwarded-for"="149.88.27.220, 10.0.0.38, 10.0.8.80"

ldeffenb avatar Nov 15 '23 12:11 ldeffenb

@tmm360 It seems the attached file does not indicate the initial failure. It seems that something was stuck on the original error and could not move forward.

In order to assess this, can you reproduce it and attach the original failure logs please?

nikipapadatou avatar Nov 23 '23 10:11 nikipapadatou

Hi @tmm360. As @nikipapadatou said, it appears that the error from the dir upload in logs is a bit different as the response took ~10 milliseconds while the response duration in the issue body is ~10 minutes, which indicated to a different type of problem.

The upload error that @ldeffenb found in the logs is about missing http request content-length header for a multipart encoded upload. But that is an instant error from the initial check in contrast to a 10 minute long response from the initial failure.

janos avatar Nov 23 '23 11:11 janos

Unfortunately I'm not able to reproduce methodically, and I don't know if trying on an ex-novo node it would be reproduced. This is from an our node in production, that has traversed several bee versions. Previously I've verified local storage about corrupted chunks, and I've found many. This could be a starting point. Anyway, I also know that on last versions a cause of corruption has been fixed (concurrent uploads). If is not simple to find the cause here, and if a chunk fixer will not be deployed soon (admitting that it could be the root cause), I could try to migrate my pinned contents on a fresh node, and try again.

tmm360 avatar Nov 25 '23 17:11 tmm360

Corrupted chunks in your localstore/sharky can definitely cause random errors in other APIs that need to access those chunk references. I WILL be putting together some method of removing the corrupted chunks as soon as I get some time to put it together. It may be a multi-step and manual process, but migrating my pinned contents to another node is not an option for me due to the amount of data.

My tentative plan is to restore a -X DELETE /chunks API that will do (at least) the following: a) Remove the chunk from the retrieval index b) If there is no retrieval index entry for the actual hash returned from sharky, then remove the reference from sharky c) Scan all related other indices and remove the chunk from those indices as well. d) If the chunk is a primary pin reference, then unpin the entire set e) If the chunk is a secondary pin reference, then remove the entire parent pin set.

And of course, include extensive logging on all actions taken by this API!

ldeffenb avatar Nov 25 '23 17:11 ldeffenb

I think that it should try to recover corrupted chunks of pinned contents from the network, before otherwise remove the pin from content. I have a lot of pinned contents too, it could be done only with an automatic process.

tmm360 avatar Nov 25 '23 18:11 tmm360

I think that it should try to recover corrupted chunks of pinned contents from the network, before otherwise remove the pin from content. I have a lot of pinned contents too, it could be done only with an automatic process.

You won't be able to retrieve the corrupted ones into the node that has the corruption, but you should certainly be able to simply pin the references into another, non-corrupted node. Write a script that queries the pins from the corrupted node and create the pin in the new node. Just make sure you have good logging to know which new pins were successful and which might fail. I've found that retrying the failed retrievals/pins (up to 10 times!) can sometimes get a successful operation.

ldeffenb avatar Nov 26 '23 02:11 ldeffenb

@tmm360 do you happen to have the localstore folder of the instance you were able to reproduce this bug last time?

notanatol avatar Jan 16 '24 16:01 notanatol

@notanatol yes, I've the localstore folders of both nodes, but they are very big. 281GB and 245GB. If you write me at mirko(at)etherna.io we can organize a way to pass it to you

tmm360 avatar Jan 16 '24 16:01 tmm360

@notanatol unfortunately I've tried to upload also an a fresh new node, without any pinning in it, but i still get this error. Maybe a tcpdump in front of node could be useful to identify the issue?

tmm360 avatar Mar 01 '24 12:03 tmm360

@notanatol unfortunately I've tried to upload also an a fresh new node, without any pinning in it, but i still get this error. Maybe a tcpdump in front of node could be useful to identify the issue?

I doubt it, it's likely related to node's internal state. What are you trying to upload? can you share that file? maybe I'll be able to reproduce it

notanatol avatar Mar 01 '24 13:03 notanatol

@notanatol ok I will send you a link to it by email

tmm360 avatar Mar 01 '24 13:03 tmm360

I've found solution to my issue. I was trying to upload a file with multipart/form-data, but http client didn't include a ContentLength header for each file in request payload. It shouldn't be required by http standard (my client wasn't adding it), but bee requires it. This was causing to hit the error "content-length missing", exposed by debug logs.

tmm360 avatar Mar 03 '24 21:03 tmm360

I've found solution to my issue. I was trying to upload a file with multipart/form-data, but http client didn't include a ContentLength header for each file in request payload. It shouldn't be required by http standard (my client wasn't adding it), but bee requires it. This was causing to hit the error "content-length missing", exposed by debug logs.

perhaps we should change it to a 400 and provide details in the payload

notanatol avatar Mar 04 '24 01:03 notanatol