erigon icon indicating copy to clipboard operation
erigon copied to clipboard

Performance degradation on AMD EPYC in compare to M1 Max

Open academe-01 opened this issue 1 year ago • 1 comments

Hello! Any ideas why I'm getting performance degradation on AMD (stats based on blocks > 10000000):

  1. Macbook M1 Max (64GB, 1xNVME, no RAID) - 98 blk/s, and 13000 tx/s
  2. AMD EPYC 7543 (1024GB, 4xNVME RAID10) - 42 blk/s, and 7000 tx/s

On both machines nodes getting data from local geth full node. I'm wondering if erigon should be recompiled with special AMD optimization flags or M1 MAX just too good to be true ?

academe-01 avatar Oct 12 '22 18:10 academe-01

  1. Show logs
  2. Likely it depends on disk read latency more than cpu
  3. Also may see OS pageCache misses, something like “vmstat -S m 1” (bi - column). Or “iostat -sh 1”

AskAlexSharov avatar Oct 13 '22 06:10 AskAlexSharov

@AskAlexSharov

Let's better discuss just pure MDBX operations. Using intergration mdbx_to_mdbx (which is broken btw https://github.com/ledgerwatch/erigon/issues/5708#issuecomment-1284562455 as it's simply completely overwrites dst db because of os.RemoveAll(to))

Considering it's just a seq read/writes I'm getting avg performance ~1000-3000 kv/second. I can't call it fast at all. There is 1TB RAM available. Drives are directly attached to server and DC grade. Both src/dst drives are identical and their specifications are impressive: Sequential Read 6200 MB/s Sequential Write 2900 MB/s Random Read 980K IOPS Random Write 180K IOPS

https://semiconductor.samsung.com/ssd/enterprise-ssd/pm1725b/mzpll3t2hajq-00005/

SRC drive (md7) - RAID0 DST drive (nvme2c1n1) - no RAID, both ext4

Based on IOSTAT on SRC DB drive rKB/s is 20000-50523 and r/s is ~2000-12340 and these values are extremely low. While DST DB drive almost never has any write load.

Both drives in use only by integration tool.

What catch my attention is CPU load. Integration spawn multiple threads. 2x of them always 100%, rest are free.

I believe if I find answers on simply DB read/writes slow performance I'll understand better the nature of why MacBook gives better performance to DELL R7525 monster.

root@b:~# cat /proc/meminfo 
MemTotal:       1056420564 kB
MemFree:        602260524 kB
MemAvailable:   1034738876 kB
Buffers:           79156 kB
Cached:         425737572 kB
SwapCached:            0 kB
Active:          4629636 kB
Inactive:       427804776 kB
Active(anon):       1420 kB
Inactive(anon):  6621160 kB
Active(file):    4628216 kB
Inactive(file): 421183616 kB
Unevictable:        3072 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:            234152 kB
Writeback:             0 kB
AnonPages:       6622776 kB
Mapped:         412580696 kB
Shmem:              8216 kB
KReclaimable:   11804956 kB
Slab:           13598232 kB
SReclaimable:   11804956 kB
SUnreclaim:      1793276 kB
KernelStack:       47264 kB
PageTables:      4863256 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    528210280 kB
Committed_AS:   51385252 kB
VmallocTotal:   34359738367 kB
VmallocUsed:     1722996 kB
VmallocChunk:          0 kB
Percpu:           186880 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     3657252 kB
DirectMap2M:    344064000 kB
DirectMap1G:    725614592 kB
root@b:~# iostat -x 30

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.11    0.00    0.32    0.86    0.00   97.71

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0             7.50    174.00     0.00   0.00    0.00    23.20    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.20
dm-1             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop0            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop1            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop10           0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop11           0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop2            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop3            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop4            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop5            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop6            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop7            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop8            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
loop9            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
md7           12476.00  58542.00     0.00   0.00    0.07     4.69    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.85 100.20
nvme0c0n1     6312.00  29928.00     1.00   0.02    0.07     4.74    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.45  98.00
nvme1c1n1     6172.50  28878.00     0.50   0.01    0.07     4.68    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.43  98.40
nvme2c1n1        0.00      0.00     0.00   0.00    0.00     0.00    3.00     40.00     7.00  70.00    0.00    13.33    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.40

intergration.log

Any ideas?

Turning on/off mdbx.UtterlyNoSync/mdbx.NoReadahead on dbs didn't show any visible improvement.

academe-01 avatar Oct 20 '22 11:10 academe-01

  1. fixed by https://github.com/ledgerwatch/erigon/pull/5810
  2. try enable Readahead on src db (it's disabled by default, because usually DB >> RAM)
  3. if you have tons of RAM, likely you can do in shell cat mdbx.dat > /dev/null it will fast sequentially load db to RAM
  4. it must not use CPU, I don't know why you see it. you can add --pprof flag - it will print in logs command for cpu profiling: something like go tool pprof -png http://127.0.0.1:6060/debug/pprof/profile\?seconds\=20 > cpu.png

AskAlexSharov avatar Oct 20 '22 12:10 AskAlexSharov

@AskAlexSharov

  1. thanks
  2. already tried, didn't help much
  3. it's not an option, i have that much free ram only in test env, in production i have only ~256 gb free which still should give millions of KV reads/sec and not just ~1000-3000/sec.
  4. will do and update you soon

academe-01 avatar Oct 20 '22 12:10 academe-01

  1. as an alternative of mdbx_to_mdbx - see also:
make db-tools
./build/bin/mdbx_dump
./build/bin/mdbx_load
./build/bin/mdbx_stat -a 

AskAlexSharov avatar Oct 20 '22 12:10 AskAlexSharov

@AskAlexSharov

Alex, out of curiosity, Can you run mdbx_to_mdbx on your dataset and share at least some output strings? I would like to see a performance in real life, maybe I am fighting with a ghost? So I can compare time for any specific bucket and make conclusions.

academe-01 avatar Oct 20 '22 12:10 academe-01

4. it must not use CPU, I don't know why you see it. you can add --pprof flag - it will print in logs command for cpu profiling: something like go tool pprof -png http://127.0.0.1:6060/debug/pprof/profile\?seconds\=20 > cpu.png

@AskAlexSharov

did it for 60 sec. Does mdbx.Cursor look slow for you ?

cpu

academe-01 avatar Oct 20 '22 12:10 academe-01