scylla-cluster-tests
scylla-cluster-tests copied to clipboard
`node_exporter` may hang on a DB node with the `error encoding and sending metric family: write tcp %IP%:9100` error
Issue description
- [ ] This issue is a regression.
- [x] It is unknown if this issue is a regression.
Setting up 2023.1.11 Scylla version one of the nodes hung with the following errors:
2024-09-13T11:33:57.764+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1 !INFO | scylla[15426]: \
[shard 0] stream_session - [Stream #10657cf0-71c4-11ef-830a-21e3b321ba22] Streaming plan for Bootstrap-system_distributed-index-10 succeeded, peers={10.142.0.14}, tx=0 KiB, 0.00 KiB/s, rx=0 KiB, 0.00 KiB/s
2024-09-13T11:34:01.006+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1 !INFO | node_exporter[14047]: \
ts=2024-09-13T11:34:00.709Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 10.142.0.10:9100" msg="->10.142.0.22:60390: write: broken pipe"
2024-09-13T11:34:01.017+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1 !INFO | node_exporter[14047]: \
ts=2024-09-13T11:34:00.728Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 10.142.0.10:9100" msg="->10.142.0.22:60390: write: broken pipe"
...
2024-09-13T12:31:19.282+00:00 rolling-upgrade-ltncy-rgrssn--ubunt-db-node-68892a87-0-1 !INFO | node_exporter[14047]: \
ts=2024-09-13T12:31:19.031Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp 10.142.0.10:9100" msg="->10.142.0.22:33436: write: broken pipe"
Later CI job was aborted.
Steps to Reproduce
- Setup
custom_d1(with special disk config) 3-node DB cluster - See error
- [and so on...]
Expected behavior: node exporter must always be working correctly.
Actual behavior: node exporter may randomly hang.
Impact
Setup of a DB nodes hangs making a test run be spoiled.
How frequently does it reproduce?
~3/11 test runs. It is too frequent.
Installation details
SCT Version: master
Scylla version (or git commit hash): 2023.1.11-0.20240729.5a79e79a0320 with build-id 4daf2e1487b1ab784ff564a6c8fd75f9ddd8a9ac
Logs
- test_id: 68892a87-5b38-440a-87eb-dcaa3f7fd04c
- job log: scylla-staging/valerii/vp-rolling-upgrade-latency-regression#13
@vponomaryov
if it's the node_exporter on the DB node, I think it's something that needs to be reported on scylla core...
@vponomaryov
if it's the node_exporter on the DB node, I think it's something that needs to be reported on scylla core...
We have a lot of configuration code for it in SCT.
@vponomaryov if it's the node_exporter on the DB node, I think it's something that needs to be reported on scylla core...
We have a lot of configuration code for it in SCT.
again if it's a db node, we are not configuring it, scylla core is.
if it's still visible or happpening, please raise it with core