dagstore
dagstore copied to clipboard
initializeShard lacks concurrency control
If I understand correctly, concurrency control is missing here. https://github.com/filecoin-project/dagstore/blob/1de8e01fd7d9ad5b297ecfb405dca1c3bea86d6f/dagstore_control.go#L103
It can cause the following situations:
- If the scale of the storage cluster is large, the concurrency of initializeShard will be very high, which will cause the sector to not finish indexing after FIN (still occupying the file), which will cause the state of unsealed deleted by lotus-worker to be deleted, but If the space is not released, the output of the cluster will slowly decline.
- For clusters that have been archived, all indexes have not been established successfully or some indexes are lost due to other circumstances: boost will re-establish indexes. If the amount to be restored is large, the concurrency of initializeShard will also be huge. When the amount of concurrency is particularly large, the read performance will be greatly reduced.
@dirkmc
I think the retrieval failure key not found
has a lot to do with this.
MaxConcurrentReadyFetches
MaxConcurrentReadyFetches
I think this variable can control the concurrency of reading data. After actually using it, it has no effect, and the problem is reopened.
root@hostname:~# ps -ef |grep boost
root 28607 300608 0 17:09 pts/7 00:00:00 grep --color=auto boost
root 141437 1 99 11:07 pts/3 07:00:41 boostd -vv run
root@hostname:~# lsof -n -p 141437 |wc -l
24143
root@hostname:~# ss -antup |wc -l
205232
My boost and miner are deployed on one, because there are many indexes that have not been successfully established, and now the number of TCP connections is abnormally high. There is an unusually high number of TIME-WAIT links to lotus-worker. (I don't know if it's the link for lotus-miner or boost)
MaxConcurrencyStorageCalls = 100
This variable also looks useless