Sia
Sia copied to clipboard
Sia host became very slow after unlocking wallet in v1.3.3
BUG REPORT
Stack Trace or error message
I have a host with a lot of data uploaded (several TB). I updated to v1.3.3 few days ago.
Now after I unlock the wallet after 10-30 minutes all siac
commands become very slow (sometimes only siac host
, sometimes both siac
and siac host
). They run for more then 10 minutes. SiaHub thinks that the host is down.
Dump of goroutines when it is stuck: https://gist.githubusercontent.com/starius/dcce9bf43197eab55f2a180d8c7b4d3c/raw/4e9275abdb15d7d7bc999bce40e8a627f450c8a2/gistfile1.txt
I think there is a dead lock somewhere in the code.
Is it safe do downgrade to the previous version?
Expected Behavior
Everything works without hanging.
How to reproduce it (as minimally and precisely as possible)
It happened only on one of my hosts. Others work well.
Environment
- Sia version: 1.3.3
- OS: Linux 64 bits
After few hours it self-fixed somehow. Now only 283 goroutines are running.
In transactionpool.log
I see a lot of lines like these:
accept.go:335: [DEBUG] Beginning broadcast of transaction set
accept.go:340: [DEBUG] Transaction set broadcast has failed: transaction set contains only duplicate transactions
I recorded mutex profiling using runtime.SetMutexProfileFraction(1)
and got the following profile.
PS. It worth enabling this mode as well in profile/
directory.
I see RPCRenewContract is involved in the problematic branch in the profile.
I think I found the root cause.
managedRPCRenewContract calls managedFinalizeContract which calls managedAddStorageObligation which locks h.mu
and under the mutex calls AddSectorBatch which goes through all sectors of the contract and updates counts.
For a big contract it can take a while to update all the counts and all other functions involving h.mu
(basically everything related to host) are locked. My suggestion is to avoid locking h.mu
when the counts are updated. Counts' updates can be done in background in idempotent way (in case the server crashes). Or just have a tracing GC instead of refcount.
Yes, this is a known issue actually. There are a few potential fixes, the one I would really like to see involves re-doing the way the host stores sectors so that you can just renew them all in constant time. The other thing you can do is write a big WAL entry indicating what sectors need to be updated, and then you update them later without blocking the whole time. It's still a big scalability issue, but at least it does not cause severe blocking.
Same issue here. I had to kill (SIGKILL
) siad daemon after 12+ hours of excessive io load. I've been delisted from siahub.info and wasn't able accepting new contracts. I've been forced to downgrade to 1.3.2. Hope this will be fixed soon as 1.3.3 is completely unusable for me. Or there is some workaround to avoid that locks?
Did downgrade to 1.3.2 help?
On Wed, Jun 20, 2018 at 1:52 PM, volvox-globator [email protected] wrote:
Same issue here. I had to kill (SIGKILL) siad daemon after 12+ hours of excessive io load. I've been delisted from siahub.com and wasn't able accepting new contracts. I've been forced to downgrade to 1.3.2. Hope this will be fixed soon as 1.3.3 is completely unusable for me. Or there is some workaround to avoid that locks?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NebulousLabs/Sia/issues/3111#issuecomment-398707991, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4KW-zdviTfzzqdhA5c7dQgZNWWUZPbks5t-ilqgaJpZM4Uq7Wr .
-- Best regards, Boris Nagaev
I've downgraded just a while ago, will see. I have similar setup as you - running host on 64 bits Debian Linux, hosted 2+ TB and about 900 contracts. It seems everything is alright now but I have to wait for new contract calls to be sure. I can provide more information if you want.
Well, downgrading didn't solve anything, same behavior with 1.3.2 after few hours.
I think a workaround would be to reject renew if contract size is more than X. But I think the same issue happens when contract is finished as well.
On Wed, Jun 20, 2018 at 4:30 PM, volvox-globator [email protected] wrote:
Well, downgrading didn't solve anything, same behavior with 1.3.2 after few hours.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NebulousLabs/Sia/issues/3111#issuecomment-398748648, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4KW3k56oIUonToxLNVFf68XGSelNARks5t-k5lgaJpZM4Uq7Wr .
-- Best regards, Boris Nagaev
After some time it returned back to normal. I suspect there was some huge contract. I'm considering move Sia generated files to the SSD drive to speed up the process in the future.
I am having the same error hosting with version 1.3.3, increasing ulimit nofile to 10000+ does not help. First a slow-down on RPC, then Sia-UI crashes after not being able to communicate over RPC/API port 9980. During slowdown, wallet module will get a result, while no result from host module. I opened a ticket, issue #3141, after finding a similar nofile problem from the past.