cardano-graphql icon indicating copy to clipboard operation
cardano-graphql copied to clipboard

cardano-graphql fatal errors don't fully fail the systemd service unit

Open johnalotoski opened this issue 2 years ago • 0 comments

Summary

While running the cardano-graphql NixOS service as a systemd unit, at least in some cases, when an error is logged, the error appears to result in a fatal condition where the cardano-graphql process no longer continues to function as no further activity happens in that systemd unit as would normally. However, the systemd process doesn't die, systemd therefore doesn't restart the service even though it's non-functional and the cardano-graphql process ends up blocking until manual intervention occurs.

This might be due to an exception which occurs in cardano-graphql that is eventually caught here, then the cardano-graphql process stops, but node continue running and systemd believes the service is still running.

If so, logging a message that the server is exiting due to an exception after logging the error would be helpful. Logging the request associated with the exception would also be helpful.

Steps to reproduce the bug

Run cardano-graphql in an explorer stack under load. Example problem which randomly occurs -- watch hasura client initialize, then see an error thrown (happens randomly) after which point no further activity will happen in the process.

cardano-graphql-start[$PID]: {"name":"cardano-graphql","hostname":"explorer-a","pid":$PID,"level":30,"module":"HasuraClient","msg":"Initializing","time":"$TIMESTAMP","v":0}
cardano-graphql-start[$PID]: {"name":"cardano-graphql","hostname":"explorer-c","pid":$PID,"level":50,"msg":"database query error: {\"response\":{\"errors\":[{\"extensions\":{\"code\":\"unexpected\",\"path\":\"$\"},\"message\":\"database query error\"}],\"status\":200},\"request\":{\"query\":\"query {\\n          epochs (limit: 1, order_by: { number: desc }) {\\n              adaPots {\\n                  reserves\\n              }\\n          }\\n          rewards_aggregate {\\n              aggregate {\\n                  sum {\\n                      amount\\n                  }\\n              }\\n          }\\n          utxos_aggregate {\\n              aggregate {\\n                  sum {\\n                      value\\n                  }\\n              }\\n          }\\n          withdrawals_aggregate {\\n              aggregate {\\n                  sum {\\n                      amount\\n                  }\\n              }\\n          }\\n      }\"}}","time":"$TIMESTAMP","v":0}

Actual Result

  • Manual intervention required to restart a non-functional cardano-graphql service

Expected Result

  • A fatal error which renders the cardano-graphql process non-functional to completely exit with a failure code so that systemd recognizes a unit failure and will take predetermined action.

Environment

Cardano-graphql 7.0.X and newer unreleased test branches

Platform

  • [ ] Linux (Ubuntu)
  • [X] Linux (Other)
  • [ ] macOS
  • [ ] Windows

Platform version

NixOS 21.11

Runtime

  • [X] Node.js
  • [ ] Docker

Runtime version

v12.15.0

johnalotoski avatar Dec 19 '22 23:12 johnalotoski