boxo icon indicating copy to clipboard operation
boxo copied to clipboard

chunker: infinite loop in the Rabin chunker

Open MichaelMure opened this issue 6 years ago • 1 comments

At Infura we got some of our nodes ending in an infinite loop in the Rabin chunker, with the CPU at 100% until something bad happen (the context doesn't reach there so loop keep spinning even though the request is long gone).

Here is a profiling: profile001

Original requests are add with the following parameters: [chunker=rabin-262144-524288-1048576,encoding=json,hash=sha2-256,inline-limit=32,pin=false,progress=true,stream-channels=true,]. Some also have trickle=true but that doesn't seem to matter.

Profiling lead to an infinite loop in https://github.com/whyrusleeping/chunker/blob/master/chunker.go#L209-L336 but it's unclear to me what is happening. if err == io.ErrUnexpectedEOF { err = nil } is suspicious but that kinda fly over my head.

MichaelMure avatar Dec 16 '19 18:12 MichaelMure

Thank you for submitting your first issue to this repository! A maintainer will be here shortly to triage and review. In the meantime, please double-check that you have provided all the necessary information to make this process easy! Any information that can help save additional round trips is useful! We currently aim to give initial feedback within two business days. If this does not happen, feel free to leave a comment. Please keep an eye on how this issue will be labeled, as labels give an overview of priorities, assignments and additional actions requested by the maintainers:

  • "Priority" labels will show how urgent this is for the team.
  • "Status" labels will show if this is ready to be worked on, blocked, or in progress.
  • "Need" labels will indicate if additional input or analysis is required.

Finally, remember to use https://discuss.ipfs.io if you just need general support.

welcome[bot] avatar Jun 16 '23 09:06 welcome[bot]

Btw, this is with go-ipfs 0.4.22 so v0.0.1 of this package, but that's the same github.com/whyrusleeping/chunker as master.

MichaelMure avatar Dec 16 '19 18:12 MichaelMure

@MichaelMure are you by chance able to share an input ( publicly or privately ) that reliably reproduces this? Or is the problem intermittent and the same input might hang a daemon once, and then pass through just fine on rerun?

ribasushi avatar Dec 17 '19 03:12 ribasushi

Unfortunately, that's all the information I have. It seems to be fairly rare, but to be fair, not many people play with the Rabin chunker.

MichaelMure avatar Dec 17 '19 09:12 MichaelMure

@MichaelMure understood. It definitely sounds like a missing underflow protection, but it will take me another couple days to get my test setup working well to pin this down. I should have something for you Thu-ish

ribasushi avatar Dec 17 '19 10:12 ribasushi

Note: our current rabin implementation is terrible. I wouldn't waste too much time trying to fix it.

The buzhash implementation in go-ipfs master should work much better.

Stebalien avatar Dec 17 '19 17:12 Stebalien

Will it get deprecated/dropped ? Because if someone can take out our nodes with a bad request, that's not so fun.

MichaelMure avatar Dec 17 '19 17:12 MichaelMure

Probably not? It's just a low-priority issue given that we support an additional chunker and are working on yet another chunker.

If you need it fixed now, I'm happy to accept a patch. However, the HTTP API is not designed to be a public API so this issue is significantly less critical for the IPFS team.

Stebalien avatar Dec 17 '19 17:12 Stebalien

I understand the low priority. My question is more if you plan to retire Rabin once Buzhash is there for good.

MichaelMure avatar Dec 17 '19 23:12 MichaelMure

@MichaelMure there's a sustained effort over the next ~month to better quantify what exactly do we want from a CDC, and where our actual bottlenecks are. Please track the main text of https://github.com/ipfs/specs/issues/227#issue-532891529 for further updates.

ribasushi avatar Dec 17 '19 23:12 ribasushi

@MichaelMure just a quick update on this ticket: this is almost certainly a ( but not 100% proven ) bug in the way the rabin chunker implementation is reading from the stream, resulting in an unexpected under-read and thus a loop for data that never comes.

Since this issue got opened originally, we have developed a different subsystem providing a unified read-interface for the rest of the ingestion pipeline, namely https://pkg.go.dev/github.com/ipfs/[email protected]?tab=doc#pkg-overview

This bug will be solved in the process of migrating to that system.

ribasushi avatar Apr 30 '20 13:04 ribasushi

Awesome thanks for the update :)

MichaelMure avatar Apr 30 '20 13:04 MichaelMure