dagstore icon indicating copy to clipboard operation
dagstore copied to clipboard

initializeShard lacks concurrency control

Open beck-8 opened this issue 1 year ago • 5 comments

If I understand correctly, concurrency control is missing here. https://github.com/filecoin-project/dagstore/blob/1de8e01fd7d9ad5b297ecfb405dca1c3bea86d6f/dagstore_control.go#L103

It can cause the following situations:

  1. If the scale of the storage cluster is large, the concurrency of initializeShard will be very high, which will cause the sector to not finish indexing after FIN (still occupying the file), which will cause the state of unsealed deleted by lotus-worker to be deleted, but If the space is not released, the output of the cluster will slowly decline.
  2. For clusters that have been archived, all indexes have not been established successfully or some indexes are lost due to other circumstances: boost will re-establish indexes. If the amount to be restored is large, the concurrency of initializeShard will also be huge. When the amount of concurrency is particularly large, the read performance will be greatly reduced.

@dirkmc

beck-8 avatar Jul 30 '23 08:07 beck-8

I think the retrieval failure key not found has a lot to do with this.

beck-8 avatar Jul 30 '23 08:07 beck-8

MaxConcurrentReadyFetches

beck-8 avatar Jul 31 '23 02:07 beck-8

MaxConcurrentReadyFetches

I think this variable can control the concurrency of reading data. After actually using it, it has no effect, and the problem is reopened.

beck-8 avatar Jul 31 '23 09:07 beck-8

root@hostname:~# ps -ef |grep boost
root      28607 300608  0 17:09 pts/7    00:00:00 grep --color=auto boost
root     141437      1 99 11:07 pts/3    07:00:41 boostd -vv run
root@hostname:~# lsof -n -p 141437 |wc -l
24143
root@hostname:~# ss -antup |wc -l
205232

My boost and miner are deployed on one, because there are many indexes that have not been successfully established, and now the number of TCP connections is abnormally high. There is an unusually high number of TIME-WAIT links to lotus-worker. (I don't know if it's the link for lotus-miner or boost)

beck-8 avatar Jul 31 '23 09:07 beck-8

MaxConcurrencyStorageCalls = 100 This variable also looks useless

beck-8 avatar Jul 31 '23 09:07 beck-8