graph-node [Bug] Panic and crashloop

Bug report

One of our indexers suddenly started crashing with the following. We're not sure why.

Relevant log output

thread 'tokio-runtime-worker' panicked at 'failed to parse mappings: Bad magic number (at offset 0)', chain/ethereum/src/capabilities.rs:62:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Feb 22 16:50:34.867 INFO Data source count at start: 2, sgd: 42628, subgraph_id: QmQ4pbd8UFcipKr5z3cYukp6kCj6pYRDYmQv2JYrcLQDBo, component: SubgraphInstanceManager
Panic in tokio task, aborting!

IPFS hash

No response

Subgraph name or link to explorer

No response

Some information to help us out

[ ] Tick this box if this bug is caused by a regression found in the latest release.
[ ] Tick this box if this bug is specific to the hosted service.
[X] I have searched the issue tracker to make sure this issue is not a duplicate.

OS information

None

Feb 22 '24 16:02 paymog

Added RUST_BACKTRACE=1 and now I see the following (the logs are actually messier and I tried to clean them up)

tokio-runtime-worker' panicked at 'failed to parse mappings: Bad magic number (at offset 0), ', sgdchain/ethereum/src/capabilities.rs: :4262862, backtrace:

   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: graph_core::subgraph::instance_manager::SubgraphInstanceManager<S>::build_subgraph_runner::{{closure}}
   4: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
   5: tokio::runtime::task::core::Feb 22 16:57:38.576Core< TDEBG, S>tokio::runtime::task::raw::poll
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::scheduler::multi_thread::worker::run
   9: tokio::runtime::task::raw::poll
  10: tokio::runtime::task::UnownedTask<S>::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Panic in tokio task, aborting!

Feb 22 '24 17:02 paymog

hey @paymog is this specific to this subgraph QmQ4pbd8UFcipKr5z3cYukp6kCj6pYRDYmQv2JYrcLQDBo or do you see this on multiple deployments?

Feb 23 '24 09:02 azf20

turns out this happened because an unbuilt subgraph was deployed into our infra. We mitigated by removing the invalid subgraph. It would be good if graph node could just kill the particular instance manager thread instead of the whole process.

I can't quite remember which ipfs hash was causing the issue - it may have been that one or a different one.

Feb 23 '24 15:02 paymog

What does 'unbuilt subgraph' mean here? Does that mean AssemblyScript was deployed instead of WASM blobs?

Feb 23 '24 21:02 lutter

Yup! The subgraph was uploaded without first running graph build so it was assembly script .ts files) and not wasm

Feb 24 '24 15:02 paymog

Wild. Is there maybe a bug in graph-cli that deployed assembly script sources? Also, I don't recognize the sgdchain/ethereum/src/capabilities.rs file name in the graph-node sources. Is this crashloop happening in vanilla graph-node?

Feb 25 '24 20:02 lutter

The subgraph wasn't uploaded using the graph cli, it was uploaded using a customer CLI tool. Whoops, I probably didn't clean up the logs perfect, I think the sgd prefix on the path is incorrect. I think the right path is chain/ethereum/src/capabilities.rs. Yup, the crash loop is happening in vanilla v0.34.0

Feb 26 '24 14:02 paymog