thor icon indicating copy to clipboard operation
thor copied to clipboard

How to reduce disk load

Open atomlab opened this issue 4 years ago • 17 comments

Hi. How to reduce disk load on read for thor? Can I reduce count of threads for example? We use fast NVME SSD on server.

/opt/vechain/bin/thor --network main --data-dir /opt/vechain/data --api-addr 0.0.0.0:8669 --verbosity 9 --max-peers 50 --cache 4096

/opt/vechain/bin/thor --version
Thor version 1.3.3-419e32e-release

thor proccess

# ps auxf| grep thor
vechain  19062  0.0 55.5 6645136 4338496 ?     Ssl  Apr22 7558:48 /opt/vechain/bin/thor 
--network main --data-dir /opt/vechain/data --api-addr 0.0.0.0:8669 --verbosity 9 
--max-peers 50 --cache 4096

217 threads (!)

# ps -e -T | grep 19062| wc -l
217

iotop

Screenshot 2020-04-24 at 08 35 37

iostat -x 1

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme0n1       29072.00   88.00 156716.00  21902.00     0.00   265.00   0.00  75.07    0.10    2.14   2.13     5.39   248.89   0.03  82.80
nvme1n1       24594.00   88.00 132324.00  21902.00     0.00   265.00   0.00  75.07    0.11    2.09   1.78     5.38   248.89   0.03  78.00
md2           53658.00  336.00 289096.00  21892.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     5.39    65.15   0.00   0.00
md0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md1              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.85    0.00   20.35    1.33    0.00   63.47

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme0n1       34277.00   13.00 190460.00     23.50     0.00     0.00   0.00   0.00    0.09    3.38   2.16     5.56     1.81   0.03  88.80
nvme1n1       29253.00   13.00 163220.00     23.50     0.00     0.00   0.00   0.00    0.09    3.38   1.66     5.58     1.81   0.03  87.20
md2           63534.00    4.00 353592.00      8.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     5.57     2.00   0.00   0.00
md0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md1              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          14.65    0.00   20.88    1.26    0.00   63.22

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme0n1       36900.00   12.00 202444.00     20.00     0.00     0.00   0.00   0.00    0.09    2.67   2.27     5.49     1.67   0.02  91.60
nvme1n1       31646.00   12.00 176776.00     20.00     0.00     0.00   0.00   0.00    0.09    2.67   1.89     5.59     1.67   0.03  88.40
md2           68545.00    4.00 379216.00      8.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     5.53     2.00   0.00   0.00
md0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md1              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17.14    0.00   17.90    6.30    0.00   58.66

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
nvme0n1       28694.00  110.00 161896.00  21780.00     0.00   207.00   0.00  65.30    0.11    2.00   2.26     5.64   198.00   0.03  85.60
nvme1n1       24235.00  110.00 135280.00  21780.00     0.00   207.00   0.00  65.30    0.11    2.04   1.86     5.58   198.00   0.03  78.80
md2           52931.00  277.00 297208.00  21704.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     5.62    78.35   0.00   0.00
md0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
md1              0.00   14.00      0.00     56.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     4.00   0.00   0.00

atomlab avatar Apr 24 '20 05:04 atomlab

There are some scenarios will make high disk load(read operation in this case):

  1. re-synchronization
  2. on-going pruning
  3. intensive requests of query blocks/txs/state by RESTful API

libotony avatar Apr 24 '20 07:04 libotony

There are some scenarios will make high disk load(read operation in this case):

  1. re-synchronization
  2. on-going pruning
  3. intensive requests of query blocks/txs/state by RESTful API

Thank you for response.

thor on the last block

root@vet-02:~# curl http://localhost:8669/blocks/best
{"number":5722736,"id":"0x0057527088bc55a52622f8e9492fcea93567c89596cf5d1eb183181ad661c465",

There wasn't no one blocks/txs/state request.

Can I set threads limit? I think it too much for 4-8 vCPU.

# ps -e -T | grep 19062| wc -l
217

atomlab avatar Apr 24 '20 07:04 atomlab

thor debug logs. Last 1000 lines. https://pastebin.com/raw/D78yRuym

atomlab avatar Apr 24 '20 07:04 atomlab

Usually there the node does not perform that high disk load, can you check if there is another program performing requests to the RESTful API and whether the high io load happens all the time

libotony avatar Apr 24 '20 08:04 libotony

Can I set threads limit? I think it too much for 4-8 vCPU.

# ps -e -T | grep 19062| wc -l
217

It's not recommended to limit threads count, although you can call runtime.GOMAXPROCS to achieve this. Unless the RAM pressure is a concern.

BTW. it'll be helpful to tail lines from main.db/LOG when high disk load noticed.

qianbin avatar Apr 24 '20 09:04 qianbin

When I stop thor process then disk load it's okey. I have resynced blockchain and now all works fine. I will close issue. If the problem happens again I will reopen issue. Thank for yours help.

atomlab avatar Apr 27 '20 11:04 atomlab

Hi. I have this problem again. Very hight load on disk, 685 M/s. iotop shows that thor is writing on disk very intensive.

Can set limits for thor?

root@vet-01:~# curl http://localhost:8669/blocks/best

{"number":5800992,"id":"0x005884200841de3cbf66f1e41b2e7884d9ab42ce1cbac4ce9534dd4c1ab1488d","size":243,"parentID":"0x0058841f2ce30529bf0fdae20456a6e0604db5066ab14bc6c4de45e4bf5c7238","timestamp":1588496820,"gasLimit":31355345,"beneficiary":"0x7b683b0f5078a850b685b0c2f14262c86ea99b8f","gasUsed":0,"totalScore":558429382,"txsRoot":"0x45b0cfc220ceec5b7c1c62c4d4193d38e4eba48e8815729ce75f9c0ab0e4c1c0","txsFeatures":1,"stateRoot":"0x4748f7e3b68d4e000f94935e5a15143b3ce1fdb975a9082543f302ca103a9c62","receiptsRoot":"0x45b0cfc220ceec5b7c1c62c4d4193d38e4eba48e8815729ce75f9c0ab0e4c1c0","signer":"0x871ea386b71cbc16b7e21357de08778cf4e341a2","isTrunk":true,"transactions":[]}

main.db/LOG

Stdout Logs

root@vet-01:~# journalctl -fu thor
-- Logs begin at Wed 2020-04-22 18:28:05 UTC. --
May 03 09:07:26 vet-01 thor[22157]: t=2020-05-03T09:07:26+0000 lvl=info msg="imported blocks (1)" pkg=node txs=0 mgas=0.000 et=2.703ms|45.710ms mgas/s=0.000 id=[#5800995…f13016d4]
May 03 09:07:26 vet-01 thor[22157]: t=2020-05-03T09:07:26+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer cab862fef37c048d 128.1.48.167:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:26 vet-01 thor[22157]: t=2020-05-03T09:07:26+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer d1c780cdfbd83db4 152.32.188.92:11235" dir=outbound msg=MsgNewBlock
May 03 09:07:26 vet-01 thor[22157]: t=2020-05-03T09:07:26+0000 lvl=dbug msg="wash done" pkg=txpool len=1 removed=0 elapsed=1.765ms
May 03 09:07:28 vet-01 thor[22157]: t=2020-05-03T09:07:28+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer 88cf00e83688e933 152.32.140.65:11235" dir=outbound msg=MsgGetBlockIDByNumber
May 03 09:07:28 vet-01 thor[22157]: t=2020-05-03T09:07:28+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer 88cf00e83688e933 152.32.140.65:11235" dir=outbound msg=MsgGetBlocksFromNumber
May 03 09:07:31 vet-01 thor[22157]: t=2020-05-03T09:07:31+0000 lvl=dbug msg="synchronization start" pkg=comm
May 03 09:07:31 vet-01 thor[22157]: t=2020-05-03T09:07:31+0000 lvl=dbug msg="start to process block stream" pkg=node
May 03 09:07:31 vet-01 thor[22157]: t=2020-05-03T09:07:31+0000 lvl=dbug msg="process block stream done" pkg=node err=nil
May 03 09:07:31 vet-01 thor[22157]: t=2020-05-03T09:07:31+0000 lvl=dbug msg="synchronization done" pkg=comm peer="Peer dfbb840bc8324809 51.91.82.15:2600" dir=outbound
May 03 09:07:36 vet-01 thor[22157]: t=2020-05-03T09:07:36+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer a434245367373fde 128.1.42.238:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:36 vet-01 thor[22157]: t=2020-05-03T09:07:36+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer 380738d424b19f39 104.218.165.203:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:36 vet-01 thor[22157]: t=2020-05-03T09:07:36+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer dfbb840bc8324809 51.91.82.15:2600" dir=outbound msg=MsgNewBlockID
May 03 09:07:36 vet-01 thor[22157]: t=2020-05-03T09:07:36+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer 03b606278b875bd2 128.14.231.4:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:36 vet-01 thor[22157]: t=2020-05-03T09:07:36+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer d1c780cdfbd83db4 152.32.188.92:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:36 vet-01 thor[22157]: t=2020-05-03T09:07:36+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer cab862fef37c048d 128.1.48.167:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:36 vet-01 thor[22157]: t=2020-05-03T09:07:36+0000 lvl=info msg="imported blocks (1)" pkg=node txs=1 mgas=0.021 et=9.318ms|64.541ms mgas/s=0.284 id=[#5800996…f932e309]
May 03 09:07:36 vet-01 thor[22157]: t=2020-05-03T09:07:36+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer 88cf00e83688e933 152.32.140.65:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:36 vet-01 thor[22157]: t=2020-05-03T09:07:36+0000 lvl=dbug msg="tx washed out" pkg=txpool id=0x5bc9956d572e44af005800af8f64e350bbe233bfd0cad2b79ff938f592d4e07d err="known tx"
May 03 09:07:36 vet-01 thor[22157]: t=2020-05-03T09:07:36+0000 lvl=dbug msg="wash done" pkg=txpool len=1 removed=1 elapsed=3.895ms
May 03 09:07:45 vet-01 thor[22157]: t=2020-05-03T09:07:45+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer 03b606278b875bd2 128.14.231.4:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:45 vet-01 thor[22157]: t=2020-05-03T09:07:45+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer a434245367373fde 128.1.42.238:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:45 vet-01 thor[22157]: t=2020-05-03T09:07:45+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer dfbb840bc8324809 51.91.82.15:2600" dir=outbound msg=MsgNewBlockID
May 03 09:07:45 vet-01 thor[22157]: t=2020-05-03T09:07:45+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer d1c780cdfbd83db4 152.32.188.92:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:45 vet-01 thor[22157]: t=2020-05-03T09:07:45+0000 lvl=dbug msg="skip new block ID announcement" pkg=comm peer="Peer d1c780cdfbd83db4 152.32.188.92:11235" dir=outbound
May 03 09:07:45 vet-01 thor[22157]: t=2020-05-03T09:07:45+0000 lvl=info msg="imported blocks (1)" pkg=node txs=0 mgas=0.000 et=3.165ms|2.161ms mgas/s=0.000 id=[#5800997…d0d3fceb]
May 03 09:07:45 vet-01 thor[22157]: t=2020-05-03T09:07:45+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer 380738d424b19f39 104.218.165.203:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:45 vet-01 thor[22157]: t=2020-05-03T09:07:45+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer cab862fef37c048d 128.1.48.167:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:45 vet-01 thor[22157]: t=2020-05-03T09:07:45+0000 lvl=dbug msg="received RPC call" pkg=comm peer="Peer 88cf00e83688e933 152.32.140.65:11235" dir=outbound msg=MsgNewBlockID
May 03 09:07:46 vet-01 thor[22157]: t=2020-05-03T09:07:46+0000 lvl=dbug msg="wash done" pkg=txpool len=0 removed=0 elapsed=1.479ms

Screenshot 2020-05-03 at 12 02 31

iostat

Screenshot 2020-05-03 at 13 14 46

Nginx request per second is not big

 # tail -f /var/log/nginx/vechain.access.log| pv -r -l >/dev/null
[14.5 /s]

atomlab avatar May 03 '20 10:05 atomlab

I can set limit CPU/MEM/Network, but I can't set I/O limits on my server for thor process.

atomlab avatar May 03 '20 10:05 atomlab

how long did it last? It's most likely the state pruner was running, if just for several minutes.

you can filter out logs with keyword 'pruner'.

qianbin avatar May 03 '20 10:05 qianbin

how long did it last? It's most likely the state pruner was running, if just for several minutes.

It has start 2020-05-01 ~16:00 UTC.

Screenshot 2020-05-03 at 13 31 23

atomlab avatar May 03 '20 10:05 atomlab

@qianbin

you can filter out logs with keyword 'pruner'.

Last 20 records with keyword 'pruner'.

root@vet-01:~# grep pruner /var/log/syslog| tail -n 20
Apr 23 04:46:49 vet-01 thor[3920]: t=2020-04-23T04:46:49+0000 lvl=info msg="archiving account trie..." pkg=pruner
Apr 23 04:48:36 vet-01 thor[3920]: t=2020-04-23T04:48:36+0000 lvl=info msg="archived account trie" pkg=pruner nodes=18274 entries=6777 storageNodes=773300 storageEntries=234992
Apr 23 04:52:12 vet-01 thor[3920]: t=2020-04-23T04:52:12+0000 lvl=info msg="sweeping stale nodes..." pkg=pruner
Apr 23 04:52:20 vet-01 thor[3920]: t=2020-04-23T04:52:20+0000 lvl=info msg="swept stale nodes" pkg=pruner count=3364134
Apr 23 04:52:21 vet-01 thor[3920]: t=2020-04-23T04:52:21+0000 lvl=info msg=initiated pkg=pruner range="[5623959, 5691750]"
Apr 23 04:52:21 vet-01 thor[3920]: t=2020-04-23T04:52:21+0000 lvl=info msg="archiving index trie..." pkg=pruner
Apr 23 04:52:49 vet-01 thor[3920]: t=2020-04-23T04:52:49+0000 lvl=info msg="archived index trie" pkg=pruner nodes=252631 entries=120139
Apr 23 04:52:49 vet-01 thor[3920]: t=2020-04-23T04:52:49+0000 lvl=info msg="archiving account trie..." pkg=pruner
Apr 23 04:54:27 vet-01 thor[3920]: t=2020-04-23T04:54:27+0000 lvl=info msg="archived account trie" pkg=pruner nodes=17082 entries=6274 storageNodes=803692 storageEntries=248991
Apr 24 08:04:12 vet-01 thor[3920]: t=2020-04-24T08:04:12+0000 lvl=info msg="stopping pruner..."
Apr 24 08:04:16 vet-01 thor[22157]: t=2020-04-24T08:04:16+0000 lvl=info msg="pruner started" pkg=pruner range="[5623959, 5691750]" step=dropStale
Apr 28 07:58:36 vet-01 thor[22157]: t=2020-04-28T07:58:36+0000 lvl=info msg="sweeping stale nodes..." pkg=pruner
Apr 28 07:58:45 vet-01 thor[22157]: t=2020-04-28T07:58:45+0000 lvl=info msg="swept stale nodes" pkg=pruner count=3447391
Apr 28 08:00:36 vet-01 thor[22157]: t=2020-04-28T08:00:36+0000 lvl=info msg=initiated pkg=pruner range="[5691750, 5757425]"
Apr 28 08:00:36 vet-01 thor[22157]: t=2020-04-28T08:00:36+0000 lvl=info msg="archiving index trie..." pkg=pruner
Apr 28 08:01:01 vet-01 thor[22157]: t=2020-04-28T08:01:01+0000 lvl=info msg="archived index trie" pkg=pruner nodes=265525 entries=122717
Apr 28 08:01:01 vet-01 thor[22157]: t=2020-04-28T08:01:01+0000 lvl=info msg="archiving account trie..." pkg=pruner
Apr 28 08:02:47 vet-01 thor[22157]: t=2020-04-28T08:02:47+0000 lvl=info msg="archived account trie" pkg=pruner nodes=20711 entries=7786 storageNodes=697653 storageEntries=218045
May  3 10:25:10 vet-01 thor[22157]: t=2020-05-03T10:25:10+0000 lvl=info msg="stopping pruner..."
May  3 10:25:13 vet-01 thor[32273]: t=2020-05-03T10:25:13+0000 lvl=info msg="pruner started" pkg=pruner range="[5691750, 5757425]" step=dropStale
root@vet-01:~#

I have restarted thor at 10:25 today.

After restart thor

Screenshot 2020-05-03 at 13 48 15

atomlab avatar May 03 '20 10:05 atomlab

so it's nothing related to pruner. wondering how is the disk load if all http requests are stopped.

qianbin avatar May 03 '20 10:05 qianbin

so it's nothing related to pruner. wondering how is the disk load if all http requests are stopped.

What about main.db/LOG ? I can't find the reason there.

atomlab avatar May 03 '20 10:05 atomlab

so it's nothing related to pruner. wondering how is the disk load if all http requests are stopped.

What about main.db/LOG ? I can't find the reason there.

that looks all right.

qianbin avatar May 03 '20 10:05 qianbin

I have changed --cache 4096 to --cache 1024 and restart. I will watch how it helps.

/opt/vechain/bin/thor --network main --data-dir /opt/vechain/data --api-addr 0.0.0.0:8669 --verbosity 9 --max-peers 50 --cache 1024

atomlab avatar May 03 '20 10:05 atomlab

I have changed --cache 4096 to --cache 1024 and restart. I will watch how it helps.

/opt/vechain/bin/thor --network main --data-dir /opt/vechain/data --api-addr 0.0.0.0:8669 --verbosity 9 --max-peers 50 --cache 1024

oh, forgot to ask about the RAM limit. what's the limit now? 16GB is recommended. If low RAM limit set, pages may go into swap and so disk load will be raised.

qianbin avatar May 03 '20 11:05 qianbin

I have changed --cache 4096 to --cache 1024 and restart. I will watch how it helps.

/opt/vechain/bin/thor --network main --data-dir /opt/vechain/data --api-addr 0.0.0.0:8669 --verbosity 9 --max-peers 50 --cache 1024

oh, forgot to ask about the RAM limit. what's the limit now? 16GB is recommended. If low RAM limit set, pages may go into swap and so disk load will be raised.

it's limit 8Gb now, I will set 16Gb.

atomlab avatar May 03 '20 13:05 atomlab