tfchain icon indicating copy to clipboard operation
tfchain copied to clipboard

Critical Need to Prioritize Substrate Upgrades to Mitigate Operational Risks

Open sameh-farouk opened this issue 10 months ago • 5 comments

Describe the bug

Our blockchain infrastructure currently faces significant technical debt due to delayed framework updates. Specifically:

  • Vulnerability Exposure: We encountered a critical bug that was only resolved in the Substrate polkadot-1.16.0 release. Remaining on our current version (1.1.0) leaves the chain exposed to multiple known vulnerabilities patched in later releases.

  • Operational Risks: Failure to steadily upgrade risks catastrophic chain halts (e.g., finality stalls, block production failures). While we recovered from a recent multi-day outage, a recurrence in production could:

    • Damage the company’s reputation

    • Force costly emergency measures (chain reset/restart)

As the sole maintainer of the TFChain project ATM, I am facing significant resource constraints. Despite the critical nature of ongoing work, I am continuously redirected to other priorities/projects, leaving essential TFChain development and maintenance tasks at risk of delay or neglect. I recommend allocating dedicated engineering resources to systematically upgrade Substrate. Postponing this work compounds technical debt and exponentially increases operational risk.

Additional context

https://github.com/threefoldtech/tfchain/issues/1029

sameh-farouk avatar Feb 19 '25 09:02 sameh-farouk

how much work to upgrade?

despiegk avatar Feb 19 '25 09:02 despiegk

how much work to upgrade?

The required work will span several months (including iteration of development work, deploying, and testing) and cannot be precisely estimated, as there are approximately 15 releases to be migrated to, each of which may introduce breaking changes to dependent Substrate core pallets.

  • The upgrades road map follows an incremental process and cannot be executed in parallel.
  • Each upgrade typically includes migrations for core pallets, requiring careful execution and thorough testing before deployment.
  • Every release must be deployed on development networks and rigorously tested to identify and resolve potential issues.

sameh-farouk avatar Feb 19 '25 10:02 sameh-farouk

I can provide more accurate timelines once I dedicate focused effort to this matter, potentially leading 1-2 upgrades myself.

BTW This issue was known and flagged earlier (by Dylan before he left) but deprioritized due to resource constraints. Since joining the TFChain project, one of my main focuses has been addressing the backlog of bug reports, particularly critical billing-related issues, all of which were resolved in the last milestone. Erwan was previously tasked with leading upgrades but left before completing them.

sameh-farouk avatar Feb 19 '25 10:02 sameh-farouk

maybe it will be easier to just create a new TFChain 4.0 without billing we won't need nor capacity tracking and migrate nodes on it

would that be less work?

despiegk avatar Feb 19 '25 12:02 despiegk

Possibly, yes. If there’s no plan to keep version 3 operational alongside version 4, then deprioritizing it is justified. I just felt it necessary to bring this issue to attention.

That said, I’m curious what’s the timeline for planning and developing TFChain 4? I’d appreciate being involved in the discussions early on so I can contribute.

sameh-farouk avatar Feb 20 '25 09:02 sameh-farouk