go-spacemesh
go-spacemesh copied to clipboard
proposal building took too long
Description
I've seen 5-6 reports of this issue from community members: a smesher that is eligible to produce a proposal fails to do so with this error:
ERROR 46c52.proposalBuilder proposal building took too long
I've confirmed that they're running go-spacemesh v1.1.8 and that the state database is stored on SSD, so there's no immediate reason this should be slow. Also confirmed with users that their system clock is correct.
Some of these nodes successfully create one proposal then fail to create the next one, e.g.:
2023-09-26T09:55:00.424+0200 INFO 46c52.proposalBuilder proposal created {"node_id": "46c52d6764d315a58d93e904779a166d3cff5a984f72206e64018a08604c0cdd", "module": "proposalBuilder", "event": true, "sessionId": "afbdee96-59c9-4394-a931-56fbc21192f0", "layer_id": 21311, "proposal_id": "3fbbf89f79", "num txs": 0, "name": "proposalBuilder"}
2023-09-26T20:28:17.385+0200 INFO 46c52.proposalBuilder proposal created {"node_id": "46c52d6764d315a58d93e904779a166d3cff5a984f72206e64018a08604c0cdd", "module": "proposalBuilder", "event": true, "sessionId": "68596208-bcf4-4f80-b006-f9b861847d85", "layer_id": 21437, "proposal_id": "7172dc5fcd", "num txs": 0, "name": "proposalBuilder"}
2023-09-26T20:28:17.385+0200 ERROR 46c52.proposalBuilder proposal building took too long {"node_id": "46c52d6764d315a58d93e904779a166d3cff5a984f72206e64018a08604c0cdd", "module": "proposalBuilder", "sessionId": "68596208-bcf4-4f80-b006-f9b861847d85", "layer_id": 21437, "epoch_id": 5, "elapsed": "3m17.018414207s", "name": "proposalBuilder"}
Sometimes the proposal takes ~12-15 sec to create, sometimes (as in the above log) it's several minutes.
Steps to reproduce
Unknown
Actual Behavior
Proposal building sometimes succeeds, sometimes fails
Expected Behavior
Proposal building should not take so long
Environment
Please complete the following information:
- OS: win11
- Node Version: v1.1.8
Additional Resources
Log attached
Just a little more info to support: win11(ryzen 9) SSD go-sm 1.1.9 4 nodes. (CPU avg at 20%) no 3rd party monitor or apps.
layer 21713 node 1 OK layer 21720 node 2 FAIL at 28s output.zip
@xeliuqa thanks. SSD is Crucial MX500 right? monitoring is turned off right?
@xeliuqa thanks. SSD is Crucial MX500 right? monitoring is turned off right?
thats correct.
all 4 nodes are running on same hardware?
i think to reproduce someone should take latest state, and try to build proposals with different parallelism factor
all 4 nodes are running on same hardware?
i think to reproduce someone should take latest state, and try to build proposals with different parallelism factor
all 4 in same machine , yes further rewards went ok...so far
Another user reported they're using Samsung SSD 860 EVO and not running metrics
got another report of the same issue:
- Crucial MX500 as external SSD (both state.sql and post data on the same disk). The CrystalDiskMark says good speed around 450 Mb/s read/write.
- Intel NUC 1260P (64Gb RAM)
- Speedtest of internet shows 80Mbps download, 30Mbps upload (low ping)
- running 3 nodes (4th is waiting for PoET). But I don't think it is an issue as resource usage is quite low
@lrettig with the report from above, it's on the "pre 1.2.1 version" most likely overloaded SSD from time to time.
- [x] provide better debug data (as step by step latency) https://github.com/spacemeshos/go-spacemesh/pull/5130
- [ ] load active set when node boots up, and not on layer timer
- [ ] mesh hash resolution can use tortoise in memory state instead of disk (especially for checking if hare completed within hdist distance)
i reworked some inefficiencies in that area. also in v1.2.2 there will be a log like:
logger.go:130: 2023-10-18T09:46:21.554+0200 INFO proposal created {"signer": "ea734", "proposal_id": "41510aa9e1", "transactions": 2, "mesh_hash": "0000000000", "ballot_id": "831ec4935c", "layer_id": 15, "epoch_id": 3, "smesher": "ea7348fcf7ded45e8a6add6a6952724875985d705ad315a8e9e9a4597f8b39e4", "opinion hash": "0100000000", "base_ballot": "0000000000", "support": 0, "against": 0, "abstain": 0, "atx_id": "0200000000", "ref_ballot": "0000000000", "active set hash": "5a22d9a8ec", "beacon": "01000000", "votes": {"base": "0000000000", "support": [], "against": [], "abstain": []}, "latency": {"data": "1.978854ms", "tortoise": "173.229µs", "hash": "72.279µs", "txs": "3.353µs", "publish": "126.27µs", "total": "2.350632ms"}, "name": ""}
see last lines, it tracks how long "expensive" parts took time. if people will report slow proposals again, lets get this info log from them
one other thing we can do is to prepare activeset in advance. in the implementation activeset selected in the first layer of the epoch. the process of selecting activeset becomes slower with the number of activations in the network. we have to scan all activations that target the epoch and check when they were received, and if/when malfeasence proofs were received for identities that created those activations.
we don't have to wait till the first layer to initiate activeset creation process, and instead can do that when we know that activeset won't be modified anymore. due to the grading we won't select activations that were received less than 2h prior the the epoch start, therefore we can initiate process somewhat around that time.
another inefficiency is that we are trying to generate activeset when node is restarted. as we are using reference ballot to track if we already prepared activeset. instead we can save prepared activeset to the database with a boolean to mark which one we prepared.