Sia icon indicating copy to clipboard operation
Sia copied to clipboard

Sia host became very slow after unlocking wallet in v1.3.3

Open starius opened this issue 6 years ago • 12 comments

BUG REPORT

Stack Trace or error message

I have a host with a lot of data uploaded (several TB). I updated to v1.3.3 few days ago. Now after I unlock the wallet after 10-30 minutes all siac commands become very slow (sometimes only siac host, sometimes both siac and siac host). They run for more then 10 minutes. SiaHub thinks that the host is down.

Dump of goroutines when it is stuck: https://gist.githubusercontent.com/starius/dcce9bf43197eab55f2a180d8c7b4d3c/raw/4e9275abdb15d7d7bc999bce40e8a627f450c8a2/gistfile1.txt

I think there is a dead lock somewhere in the code.

Is it safe do downgrade to the previous version?

Expected Behavior

Everything works without hanging.

How to reproduce it (as minimally and precisely as possible)

It happened only on one of my hosts. Others work well.

Environment

  • Sia version: 1.3.3
  • OS: Linux 64 bits

starius avatar Jun 17 '18 18:06 starius

After few hours it self-fixed somehow. Now only 283 goroutines are running.

In transactionpool.log I see a lot of lines like these:

accept.go:335: [DEBUG] Beginning broadcast of transaction set
accept.go:340: [DEBUG] Transaction set broadcast has failed: transaction set contains only duplicate transactions

starius avatar Jun 17 '18 21:06 starius

I recorded mutex profiling using runtime.SetMutexProfileFraction(1) and got the following profile.

profile001

PS. It worth enabling this mode as well in profile/ directory.

starius avatar Jun 17 '18 23:06 starius

I see RPCRenewContract is involved in the problematic branch in the profile.

starius avatar Jun 17 '18 23:06 starius

I think I found the root cause.

managedRPCRenewContract calls managedFinalizeContract which calls managedAddStorageObligation which locks h.mu and under the mutex calls AddSectorBatch which goes through all sectors of the contract and updates counts.

For a big contract it can take a while to update all the counts and all other functions involving h.mu (basically everything related to host) are locked. My suggestion is to avoid locking h.mu when the counts are updated. Counts' updates can be done in background in idempotent way (in case the server crashes). Or just have a tracing GC instead of refcount.

starius avatar Jun 17 '18 23:06 starius

Yes, this is a known issue actually. There are a few potential fixes, the one I would really like to see involves re-doing the way the host stores sectors so that you can just renew them all in constant time. The other thing you can do is write a big WAL entry indicating what sectors need to be updated, and then you update them later without blocking the whole time. It's still a big scalability issue, but at least it does not cause severe blocking.

DavidVorick avatar Jun 18 '18 15:06 DavidVorick

Same issue here. I had to kill (SIGKILL) siad daemon after 12+ hours of excessive io load. I've been delisted from siahub.info and wasn't able accepting new contracts. I've been forced to downgrade to 1.3.2. Hope this will be fixed soon as 1.3.3 is completely unusable for me. Or there is some workaround to avoid that locks?

volvox-globator avatar Jun 20 '18 10:06 volvox-globator

Did downgrade to 1.3.2 help?

On Wed, Jun 20, 2018 at 1:52 PM, volvox-globator [email protected] wrote:

Same issue here. I had to kill (SIGKILL) siad daemon after 12+ hours of excessive io load. I've been delisted from siahub.com and wasn't able accepting new contracts. I've been forced to downgrade to 1.3.2. Hope this will be fixed soon as 1.3.3 is completely unusable for me. Or there is some workaround to avoid that locks?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NebulousLabs/Sia/issues/3111#issuecomment-398707991, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4KW-zdviTfzzqdhA5c7dQgZNWWUZPbks5t-ilqgaJpZM4Uq7Wr .

-- Best regards, Boris Nagaev

starius avatar Jun 20 '18 10:06 starius

I've downgraded just a while ago, will see. I have similar setup as you - running host on 64 bits Debian Linux, hosted 2+ TB and about 900 contracts. It seems everything is alright now but I have to wait for new contract calls to be sure. I can provide more information if you want.

volvox-globator avatar Jun 20 '18 11:06 volvox-globator

Well, downgrading didn't solve anything, same behavior with 1.3.2 after few hours.

volvox-globator avatar Jun 20 '18 13:06 volvox-globator

I think a workaround would be to reject renew if contract size is more than X. But I think the same issue happens when contract is finished as well.

On Wed, Jun 20, 2018 at 4:30 PM, volvox-globator [email protected] wrote:

Well, downgrading didn't solve anything, same behavior with 1.3.2 after few hours.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NebulousLabs/Sia/issues/3111#issuecomment-398748648, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4KW3k56oIUonToxLNVFf68XGSelNARks5t-k5lgaJpZM4Uq7Wr .

-- Best regards, Boris Nagaev

starius avatar Jun 20 '18 13:06 starius

After some time it returned back to normal. I suspect there was some huge contract. I'm considering move Sia generated files to the SSD drive to speed up the process in the future.

volvox-globator avatar Jun 20 '18 16:06 volvox-globator

I am having the same error hosting with version 1.3.3, increasing ulimit nofile to 10000+ does not help. First a slow-down on RPC, then Sia-UI crashes after not being able to communicate over RPC/API port 9980. During slowdown, wallet module will get a result, while no result from host module. I opened a ticket, issue #3141, after finding a similar nofile problem from the past.

EvilRedHorse avatar Jul 04 '18 15:07 EvilRedHorse