[Windows] [vasil-testnet-v1] the node stops responding to `query tip` and the socket is not found anymore
Internal/External Internal if an IOHK staff member.
Area node syncing on windows
Summary
When using tag vasil-testnet-v1, only on Windows, in 3 out of 4 cases, the node stops responding to query tip and the socket is not found anymore.
Steps to reproduce Steps to reproduce the behavior:
- Start node on windows using tag
vasil-testnet-v1 - query the tip of the node each 10s until the node is fully synced (syncProgress = 100)
- https://github.com/input-output-hk/cardano-node-tests/runs/6544279877?check_suite_focus=true
- waiting for the node to sync
actual_era : byron - actual_epoch: 0 - actual_block: 309 - actual_slot : 335 - syncProgress: 0
actual_era : shelley - actual_epoch: 282 - actual_block: 101157 - actual_slot : 2022[49](https://github.com/input-output-hk/cardano-node-tests/runs/6544279877?check_suite_focus=true#step:5:50)9 - syncProgress: 3
actual_era : shelley - actual_epoch: 583 - actual_block: 206081 - actual_slot : 418[50](https://github.com/input-output-hk/cardano-node-tests/runs/6544279877?check_suite_focus=true#step:5:51)25 - syncProgress: 7
actual_era : shelley - actual_epoch: 897 - actual_block: 316433 - actual_slot : 6447080 - syncProgress: 11
actual_era : shelley - actual_epoch: 1225 - actual_block: 423988 - actual_slot : 8806800 - syncProgress: 15
actual_era : allegra - actual_epoch: 1539 - actual_block: 532758 - actual_slot : 11071320 - syncProgress: 19
actual_era : allegra - actual_epoch: 1907 - actual_block: 639587 - actual_slot : 13720235 - syncProgress: 24
actual_era : mary - actual_epoch: 2206 - actual_block: 746268 - actual_slot : 15871297 - syncProgress: 28
actual_era : mary - actual_epoch: 2556 - actual_block: 8[52](https://github.com/input-output-hk/cardano-node-tests/runs/6544279877?check_suite_focus=true#step:5:53)273 - actual_slot : 18391115 - syncProgress: 33
actual_era : mary - actual_epoch: 2858 - actual_block: 957501 - actual_slot : 20564587 - syncProgress: 37
actual_era : mary - actual_epoch: 3155 - actual_block: 10632[53](https://github.com/input-output-hk/cardano-node-tests/runs/6544279877?check_suite_focus=true#step:5:54) - actual_slot : 22707147 - syncProgress: 40
=== Waiting [60](https://github.com/input-output-hk/cardano-node-tests/runs/6544279877?check_suite_focus=true#step:5:61)s before retrying to get the tip again - 0
!!!ERROR: command cardano-cli.exe query tip --testnet-magic 3 return with error (code 1): b'MuxError (MuxIOException withIODataPtr (readHandle): resource vanished (The pipe has been ended.)) "(readHandle errored)"\r\n\r\ncardano-cli.exe: MuxIOException withIODataPtr (readHandle): resource vanished (The pipe has been ended.) "(readHandle errored)"'
=== Waiting 60s before retrying to get the tip again - 1
!!!ERROR: command cardano-cli.exe query tip --testnet-magic 3 return with error (code 1): b'\r\ncardano-cli.exe: CreateFile "\\\\\\\\.\\\\pipe\\\\cardano-node": does not exist (The system cannot find the file specified.)'
=== Waiting 60s before retrying to get the tip again - 2
!!!ERROR: command cardano-cli.exe query tip --testnet-magic 3 return with error (code 1): b'\r\ncardano-cli.exe: CreateFile "\\\\\\\\.\\\\pipe\\\\cardano-node": does not exist (The system cannot find the file specified.)'
Expected behavior
The node should successfully sync and query tip should not return errors.
System info (please complete the following information):
- OS Name: Windows
- Node version
vasil-testnet-v1 - CLI version
vasil-testnet-v1
Hi, I can see this is assigned to @coot in the JIRA ticket, can someone please assign to him on GH too - I dont have the permissions
@dorin100 does the node stops? Could make sure the following traces are turn on:
TraceLocalHandshakeTraceErrorPolicyTraceLocalErrorPolicy
if they are where can I find a the log file?
@coot the node does not seem to stop.
The GH jobs that ran these tests can be found here (both are using node tag vasil-testnet-v1):
- first run - 2 out of 2 windows jobs failed - https://github.com/input-output-hk/cardano-node-tests/actions/runs/2366734246
- second run - 1 out of 2 windows jobs failed - https://github.com/input-output-hk/cardano-node-tests/actions/runs/2367102172
The logs for each job can be found at the above links in the Artifacts section.
All the jobs are run as default with:
TraceLocalHandshake = false
TraceErrorPolicy = true
TraceLocalErrorPolicy = true
Here all windows jobs are ok but we are using tag 1.34.1 (last released one) - https://github.com/input-output-hk/cardano-node-tests/actions/runs/2367105131
FYI: the below flags were not set for any of those sync tests (but only the windows ones are failing)
"TestEnableDevelopmentNetworkProtocols": true,
"TestEnableDevelopmentHardForkEras": true,
~@dorin100 when I look at the recent logs node_logs_shelley_qa_windows-latest from this run. I can only see failing Plutus scripts~
I looked at the wrong logs.
@dorin100 my Windows node synced, and I wasn't able to reproduce the issue: every 10s a cardano-cli was able to query the tip of the node. Could you please enable the following tracers and configure them with the given minimal severities:
| Tracer | Severity |
|---|---|
TraceLocalMux |
Info |
TraceLocalHandshake |
Info |
TraceLocalErrorPolicy |
Debug |
TraceErrorPolicy |
Debug |
@coot - the Window job failed again but this time I used the above tracers. The results can be found here - https://github.com/input-output-hk/cardano-node-tests/actions/runs/2495501357 (look for the [node_logs_staging_windows-latest](https://github.com/input-output-hk/cardano-node-tests/suites/6926637872/artifacts/269779719))
this was reproducible in 1 out of 2 runs also with 1.35.0-rc3 --> job: https://github.com/input-output-hk/cardano-node-tests/actions/runs/2501998632
just for completeness sake: by reading the logs I didn't found anything wrong with the node. A few seconds before cardano-node shutdown the it accepted the last local connection, while cardano-cli was failing to connect very 60s. We don't have time stamps when cardano-cli was trying to create these connections, but it seems as it's trying to do that after cardano-node terminated.
We discovered:
DiffusionErrored (ExceptionInLinkedThread "ThreadId 34" <stderr>: hPutChar: resource vanished (Broken pipe)
which indicates that there's something wrong with the logging system.
The issue is not reproducing when we run the node with "TurnOnLogging": false
PS: the above error might indicate that the io error was propagated to the top level and the node shutdown (and it should not shut down).
The above error can be found in the logs here:
- GH pipeline - https://github.com/input-output-hk/cardano-node-tests/actions/runs/2536153894
- machine -
node_logs_staging_windows-latest
So, this is legacy tracing system.
We can fix this, by migrating Daedalus to the new tracing system -- which I believe is ready to accomodate Daedalus.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.