bittensor-subnet-template icon indicating copy to clipboard operation
bittensor-subnet-template copied to clipboard

Remove abusable code from subnet template

Open cxmplex opened this issue 9 months ago • 6 comments

The default subnet code has been reused on major mainnet subnets leading to predictable issues with lack of checks towards stakeless validators, and a triggerable exception with a longer stacktrace that a normal blacklistexception.

Forcing an error in the blacklist_fn leads to an amplication of the normal blacklist response. For instance, a normal blacklist repsonse will state the short reason, i.e. not registered. The error blacklist response will include the full error message. Some error messages may be longer than others (and the ValueError one is), leading to an amplication in the normal response. The effect amplification is around 2-3x depending on the original return text of the blacklist_fn.

Utilizing these flaws, attackers were able to do two things:

Subnet 2 (missing stake check)

On subnet 2, a validators minimum stake was not being checked. As a result, a malicious miner was able to register a low stake validator, and prompt users with challenges. The purpose of these challenges was to slow miners average response time in comparison to their normal response time to a real validator. In addition, the attacker employed several sneaky methods such as monitoring other validator requests, in order to sync their request to a validators, forcing a concurrent request on the innocent miner.

As a result, I believe it is best practice to leave in a default stake blacklist as an example, and allow subnet devs to REMOVE this if they wish, not the other way around. In best practice, the safest solution should be offered as the example, not a barebones solution. If the goal was for subnet devs to inspect the blacklist function and tailor it to their use-case, this has been demonstrated to not happen on at least 6 subnets (likely more). In most subnets, a stakeless validator has no actual purpose of calling a miners synapse. In practice, this lack of a default example has been shown to cause issues in various subnets that are often not picked up on right away.

Subnet 31 (triggerable unhandled exception)

Other affected subnets with a similar issue to 31 included subnet 33, 28, 25, 16, 11, 5. Here's my leading theory on why this leads to exhaustion, rather than the normal blacklist behavior, which as @mjurbanski-reef has shown, already throws an exception normally. Let's compare both scenarios, specifically taking a look at the traceback that will be printed here:

https://github.com/opentensor/bittensor/blob/master/bittensor/axon.py#L959

Unhandled exception in blacklist_fn:

blacklist_fn -> blacklist -> dispatch

Normal handled exception stacktrace:

blacklist -> dispatch

As you can see, forcing an error in the blacklist_fn in the subnet (note not the blacklist fn in axon.py) leads to a larger stacktrace. This stacktrace would cost additional resources to gather and print, and would even possibly include stack trace information from the asyncio upstream, metagraph calls, etc. The stacktrace length should be roughly 3-5x the size due to the increased callstack.

While a normal exception thrown by blacklist with normal behavior (i.e. the person is blacklisted) leads to a much shorter stacktrace, only blacklist, then caught by dispatch.

It's my belief this is part one to why this leads to exhaustion. The second part, is the increased response size when throwing an error from the blacklist_fn. A combination of both means greater CPU/mem pressure, and greater networking pressure. When hundreds to thousands of requests are coming in, small inefficiencies like these can play a big part to exhausting your miner. From testing, removing the unhandled exception from blacklist_fn led to continued operation for the miner while being attacked, while not handling the exception and letting it pass downstream caused server reboots on various miners machines, and overall total DoS to the axon. Why this would happen? I can only guess differences in stacktrace/callstack length.

cxmplex avatar May 22 '24 08:05 cxmplex