Critical Need to Prioritize Substrate Upgrades to Mitigate Operational Risks
Describe the bug
Our blockchain infrastructure currently faces significant technical debt due to delayed framework updates. Specifically:
-
Vulnerability Exposure: We encountered a critical bug that was only resolved in the Substrate polkadot-1.16.0 release. Remaining on our current version (1.1.0) leaves the chain exposed to multiple known vulnerabilities patched in later releases.
-
Operational Risks: Failure to steadily upgrade risks catastrophic chain halts (e.g., finality stalls, block production failures). While we recovered from a recent multi-day outage, a recurrence in production could:
-
Damage the company’s reputation
-
Force costly emergency measures (chain reset/restart)
-
As the sole maintainer of the TFChain project ATM, I am facing significant resource constraints. Despite the critical nature of ongoing work, I am continuously redirected to other priorities/projects, leaving essential TFChain development and maintenance tasks at risk of delay or neglect. I recommend allocating dedicated engineering resources to systematically upgrade Substrate. Postponing this work compounds technical debt and exponentially increases operational risk.
Additional context
https://github.com/threefoldtech/tfchain/issues/1029
how much work to upgrade?
how much work to upgrade?
The required work will span several months (including iteration of development work, deploying, and testing) and cannot be precisely estimated, as there are approximately 15 releases to be migrated to, each of which may introduce breaking changes to dependent Substrate core pallets.
- The upgrades road map follows an incremental process and cannot be executed in parallel.
- Each upgrade typically includes migrations for core pallets, requiring careful execution and thorough testing before deployment.
- Every release must be deployed on development networks and rigorously tested to identify and resolve potential issues.
I can provide more accurate timelines once I dedicate focused effort to this matter, potentially leading 1-2 upgrades myself.
BTW This issue was known and flagged earlier (by Dylan before he left) but deprioritized due to resource constraints. Since joining the TFChain project, one of my main focuses has been addressing the backlog of bug reports, particularly critical billing-related issues, all of which were resolved in the last milestone. Erwan was previously tasked with leading upgrades but left before completing them.
maybe it will be easier to just create a new TFChain 4.0 without billing we won't need nor capacity tracking and migrate nodes on it
would that be less work?
Possibly, yes. If there’s no plan to keep version 3 operational alongside version 4, then deprioritizing it is justified. I just felt it necessary to bring this issue to attention.
That said, I’m curious what’s the timeline for planning and developing TFChain 4? I’d appreciate being involved in the discussions early on so I can contribute.