ckb icon indicating copy to clipboard operation
ckb copied to clipboard

[WIP] Add limitation for RPC slow queries

Open chenyukang opened this issue 1 year ago • 1 comments

What problem does this PR solve?

Currently, we have some RPC API from indexer such as get_cells, get_transactions, get_cells_capacity which may cost a lot of computing resources, it may occupy all the tokio tasks, and ckb can not response to other RPC APIs.

What is changed and how it works?

This PR try to fix this issue by adding limitation on computing resources for slow queries

What's Changed:

Related changes

  • PR to update owner/repo:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code ci-runs-only: [ quick_checks,linters ]

Side effects

  • Performance regression
  • Breaking backward compatibility

Release note

Title Only: Include only the PR title in the release note.

chenyukang avatar May 24 '24 01:05 chenyukang

I think there is something wrong with the modification of the iterator. When the user's request value exceeds the default value, it should be forced to change to the upper limit and return the result instead of returning an error directly. This is also the case in eth filter-related RPCs.

Just change the following

let limit = std::cmp::min(limit.value() as usize, iterator_next_limit);

driftluo avatar Aug 07 '24 06:08 driftluo

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Dec 06 '24 02:12 github-actions[bot]

This PR was closed because it has been stalled for 10 days with no activity.

github-actions[bot] avatar Dec 16 '24 02:12 github-actions[bot]

@doitian as you are aware, on Testnet I encountered an issue which renders a node unresponsive for a few minutes, which is very related to this PR underlying issue.

Is there any work being done or planned to address the underlying issue?

Phroi

P.S.: I don't mind adding as comment to this PR the exact description of the issue I stumbled upon.

phroi avatar Nov 15 '25 13:11 phroi

@phroi yes, you're welcome to add more details.

chenyukang avatar Nov 16 '25 00:11 chenyukang

Hey @chenyukang!! One month ago I reported this RPC call as DoS vector on Nervos Bounty without any remediation being started as a result:

{
    "id": 42,
    "jsonrpc": "2.0",
    "method": "get_cells_capacity",
    "params": [
        {
            "script": {
                "code_hash": "0x9bd7e06f3ecf4be0f2fcd2188b23f1b9fcc88e5d4b65a8637b17723bbda3cce8",
                "hash_type": "type",
                "args": "0x"
            },
            "script_type": "lock"
        }
    ]
}

This RPC call matches a large share of Testnet live cells and it was randomly discovered by a LLM while writing tests, so it's pretty easy to stumble upon.

Repeated calls progressively increase node downtime, and attempting to terminate the RPC request does not stop the processing. This means that any node exposing this method can be made unusable.

The same issue doesn't appear on a Mainnet node, likely because far fewer live cells match that query. That said, in the future, as L1 state grows, it will impact Mainnet as well.

I can't stress enough the risk for any internet-exposed RPC nodes, yet there's no fundamental mitigation deployed nor documentation, and proposed workarounds are either impractical or out of date.

I would like to formally request the work on this PR to be resumed or at the very least its DoS-able nature to be clearly documented in the RPC docs, so that practitioners can understand the trade-offs involved.

Love & Peace, Phroi

phroi avatar Nov 18 '25 18:11 phroi

@phroi Thanks for the details.

As for the solution to limit RPC cost too much server resources, we already provide some config parameters from PR: https://github.com/nervosnetwork/ckb/pull/4576 https://github.com/nervosnetwork/ckb/pull/4529

You're right, it's still a potential issue in the long term, especially for RPC get_cells_capacity.

We will continue the work on this part to prevent DoS(Testnet need to be fixed), add the related document and Nginx configuration template.

chenyukang avatar Nov 19 '25 11:11 chenyukang

Hey @chenyukang!! Yeah, the linked PRs are steps in the right direction, but they do not seem to have effect on get_cells_capacity

You're right, it's still a potential issue in the long term, especially for RPC get_cells_capacity.

We will continue the work on this part to prevent DoS(Testnet need to be fixed), add the related document and Nginx configuration template.

That's exactly what I'm looking for, thank you so much!! 🥰

Additionally, I would evaluate if it makes sense to deprecate that method, cause get_cells_capacity results can be calculated from get_cells, which is already future-proofed.

BTW please, remember to tag me in the associated PRs, not that I'm qualified for reviewing CKB node code, but I'd like to keep an eye on how/if this issue is actually solved.

Love & Peace, Phroi

phroi avatar Nov 19 '25 22:11 phroi