elixir-omg icon indicating copy to clipboard operation
elixir-omg copied to clipboard

Childchain main supervisor stopped permanently after receiving unexpected result during block submission

Open unnawut opened this issue 4 years ago • 2 comments

2020-03-15 21:47:02.742 [error] module=gen_server function=error_info/7 ⋅GenServer OMG.ChildChain.BlockQueue.Server terminating
"** (MatchError) no match of right hand side value: {:error, %{""code"" => -32000, ""message"" => ""execution aborted (timeout = 5s)""}}"
(omg_child_chain) lib/omg_child_chain/block_queue.ex:166: OMG.ChildChain.BlockQueue.Server.handle_info/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:711: :gen_server.handle_msg/6
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Last message: :check_ethereum_status⋅
2020-03-15 21:47:02.743 [info] module=OMG.ChildChain.BlockQueue.Server function=handle_continue/2 ⋅Starting Elixir.OMG.ChildChain.BlockQueue.Server service.⋅
"2020-03-15 21:47:10.038 [error] module=OMG.Eth.RootChain function=contract_ready/1 ⋅The call to contract_ready failed with: %MatchError{term: {:error, %{""code"" => -32000, ""message"" => ""execution aborted (timeout = 5s)""}}}⋅"
2020-03-15 21:47:10.039 [error] module=gen_server function=error_info/7 ⋅GenServer OMG.ChildChain.BlockQueue.Server terminating
"** (MatchError) no match of right hand side value: {:error, :root_chain_contract_not_available}"
(omg_child_chain) lib/omg_child_chain/block_queue.ex:85: OMG.ChildChain.BlockQueue.Server.handle_continue/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:388: :gen_server.loop/7
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
"Last message: {:continue, :setup}⋅"
2020-03-15 21:47:10.039 [info] module=OMG.ChildChain.BlockQueue.Server function=handle_continue/2 ⋅Starting Elixir.OMG.ChildChain.BlockQueue.Server service.⋅
2020-03-15 21:47:10.269 [error] module=gen_server function=error_info/7 ⋅GenServer OMG.ChildChain.BlockQueue.Server terminating
"** (MatchError) no match of right hand side value: {:error, :geth_still_syncing}"
(omg_child_chain) lib/omg_child_chain/block_queue.ex:82: OMG.ChildChain.BlockQueue.Server.handle_continue/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:388: :gen_server.loop/7
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
"Last message: {:continue, :setup}⋅"
2020-03-15 21:47:10.270 [info] module=OMG.ChildChain.BlockQueue.Server function=handle_continue/2 ⋅Starting Elixir.OMG.ChildChain.BlockQueue.Server service.⋅
2020-03-15 21:47:10.501 [error] module=gen_server function=error_info/7 ⋅GenServer OMG.ChildChain.BlockQueue.Server terminating
"** (MatchError) no match of right hand side value: {:error, :geth_still_syncing}"
(omg_child_chain) lib/omg_child_chain/block_queue.ex:82: OMG.ChildChain.BlockQueue.Server.handle_continue/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:388: :gen_server.loop/7
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
"Last message: {:continue, :setup}⋅"
2020-03-15 21:47:10.501 [info] module=OMG.ChildChain.BlockQueue.Server function=handle_continue/2 ⋅Starting Elixir.OMG.ChildChain.BlockQueue.Server service.⋅
2020-03-15 21:47:10.733 [error] module=gen_server function=error_info/7 ⋅GenServer OMG.ChildChain.BlockQueue.Server terminating
"** (MatchError) no match of right hand side value: {:error, :geth_still_syncing}"
(omg_child_chain) lib/omg_child_chain/block_queue.ex:82: OMG.ChildChain.BlockQueue.Server.handle_continue/2
(stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
(stdlib) gen_server.erl:388: :gen_server.loop/7
(stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
"Last message: {:continue, :setup}⋅"
2020-03-15 21:47:10.733 [error] module=OMG.ChildChain.Monitor function=handle_info/2 ⋅Child Chain supervisor crashed. Raising alarm. Reason :shutdown⋅
2020-03-15 21:47:10.733 [info] module=OMG.ChildChain.Monitor function=handle_event/2 ⋅Monitor got event: {:set_alarm, {:main_supervisor_halted, %{node: :""[email protected]"", reporter: OMG.ChildChain.Monitor}}}. Ignoring.⋅"
2020-03-15 21:47:10.734 [info] module=OMG.Eth.EthereumHeightMonitor.AlarmHandler function=handle_event/2 ⋅Elixir.OMG.Eth.EthereumHeightMonitor.AlarmHandler got event: {:set_alarm, {:main_supervisor_halted, %{node: :""[email protected]"", reporter: OMG.ChildChain.Monitor}}}. Ignoring.⋅"
2020-03-15 21:47:10.734 [info] module=OMG.Status.Metric.StatsdMonitor function=handle_event/2 ⋅inspect(Elixir.OMG.Status.Metric.StatsdMonitor) got event: {:set_alarm, {:main_supervisor_halted, %{node: :""[email protected]"", reporter: OMG.ChildChain.Monitor}}}. Ignoring.⋅"

unnawut avatar Mar 16 '20 12:03 unnawut

In OMG.ChildChain.BlockQueue, the network and contract health is only checked at setup. From log above we can see that the contracts could become unavailable during block submission.

So we need to:

  • Move this check out so that it's done periodically
  • Set the alarm when the contracts cannot be called
  • Clear the alarm when the contract call is successful and restart the childchain's main supervisor

unnawut avatar Mar 16 '20 12:03 unnawut

I would like to see this check done only if things start to fail. Instead of periodically, you could also incorporate some sort of Circuit Breaker pattern into the retry logic (https://dzone.com/articles/understanding-retry-pattern-with-exponential-back)

Seems like such a rare case the ETH mainnet going down that I would hate to pay this cost periodically even if it is miniscule.

This also could not be possible given current limitations, etc... Just thought I would put in my two cents

iGetSchwifty avatar Mar 16 '20 12:03 iGetSchwifty