scylla-machine-image icon indicating copy to clipboard operation
scylla-machine-image copied to clipboard

IO values for i7ie.2xlarge are identical to i7ie.xlarge

Open mykaul opened this issue 1 year ago • 12 comments

i7ie.xlarge:
  read_iops: 117257
  read_bandwidth: 1148572714
  write_iops: 94180
  write_bandwidth: 505684885
i7ie.2xlarge:
  read_iops: 117257
  read_bandwidth: 1148572714
  write_iops: 94180
  write_bandwidth: 505684885

Whereas https://docs.aws.amazon.com/ec2/latest/instancetypes/so.html#so_instance-store says:

i7ie.large	1 x 1250 GB	NVMe SSD	54,166 / 43,333		✓
i7ie.xlarge	1 x 2500 GB	NVMe SSD	108,333 / 86,666		✓
i7ie.2xlarge	2 x 2500 GB	NVMe SSD	216,666 / 173,332		✓

So double the performance between them.

Originally posted by @mykaul in #559

mykaul avatar Jan 23 '25 12:01 mykaul

@syuu1228 is it just a copy-paste error or these are the values we get for i7ie.2xlarge?

roydahan avatar Jan 23 '25 13:01 roydahan

Here are raw values of i7ie iotune results, it executed 3 times on each instances, and averages are shown at the bottom. It seems i7ie.large uses slowest SSD, i7ie.xlarge and i7ie.2xlarge using medium speed SSD, and larger instances (3xlarge, 6xlarge, 12xlarge...) using fastest SSD. So I decided copy i7ie.xlarge parameters to i7ie.2xlarge.

instance_type   read_iops       read_bandwidth  write_iops      write_bandwidth
i7ie.large.0    58450   574886272       47148   253077216
i7ie.large.1    58447   574871872       47144   253152960
i7ie.large.2    58452   574805824       47145   253168576
i7ie.xlarge.0   117261  1148629760      94184   505675680
i7ie.xlarge.1   117253  1148723072      94177   505678784
i7ie.xlarge.2   117257  1148365312      94181   505700192
i7ie.2xlarge.0  117266  1148608256      94166   505745536
i7ie.2xlarge.1  117270  1148134016      94161   505677120
i7ie.2xlarge.2  117266  1148369024      94177   505698688
i7ie.3xlarge.0  352843  3422118400      119127  1526412672
i7ie.3xlarge.1  352844  3421702144      118973  1526431104
i7ie.3xlarge.2  352817  3424049152      119881  1526483456
i7ie.6xlarge.0  352813  3432191232      118808  1526523520
i7ie.6xlarge.1  352837  3424704512      119741  1526485376
i7ie.6xlarge.2  352822  3428499456      119246  1526491136
i7ie.12xlarge.0 352835  3422241280      119566  1526699136
i7ie.12xlarge.1 352829  3425162240      119214  1526719872
i7ie.12xlarge.2 352832  3424160000      118033  1526715776
i7ie.18xlarge.0 352825  3425764608      119544  1526821376
i7ie.18xlarge.1 352816  3425155072      119555  1526833664
i7ie.18xlarge.2 352815  3428421120      119518  1526809088
i7ie.24xlarge.0 352824  3424752128      119147  1526504320
i7ie.24xlarge.1 352832  3422750976      119154  1526500992
i7ie.24xlarge.2 352826  3424811008      119535  1526532480
i7ie.48xlarge.0 352834  3423049728      119516  1526578176
i7ie.48xlarge.1 352831  3424246016      119574  1526628352
i7ie.48xlarge.2 352815  3428381440      119226  1526603392

i7ie.large      58449   574854656       47145   253132917
i7ie.xlarge     117257  1148572714      94180   505684885
i7ie.2xlarge    117267  1148370432      94168   505707114
i7ie.3xlarge    352834  3422623232      119327  1526442410
i7ie.6xlarge    352824  3428465066      119265  1526500010
i7ie.12xlarge   352832  3423854506      118937  1526711594
i7ie.18xlarge   352818  3426446933      119539  1526821376
i7ie.24xlarge   352827  3424104704      119278  1526512597
i7ie.48xlarge   352826  3425225728      119438  1526603306

I just ran iotune on i7ie.2xlarge again to make sure current parameter is correct, result was:

disks:
  - mountpoint: /var/lib/scylla
    read_iops: 117258
    read_bandwidth: 1150152192
    write_iops: 94138
    write_bandwidth: 508599840

Seems like result is unchanged, it almost same performance as i7ie.xlarge.

These results don't seem to match Amazon's documentation, but isn't the parameters that Amazon's documentation indicates are the sum of all drives? If it correct, probably it seems to match our measurements.

syuu1228 avatar Feb 05 '25 16:02 syuu1228

Sounds ok to me. We will know the real effect only when running performance tests on these instance types, but I don't think it's necessary right now. we may want @avikivity or @xemul to approve that.

roydahan avatar Feb 06 '25 13:02 roydahan

Can we run fio to re-verify? It makes little sense. And we can open a support issue with AWS to clarify that.

mykaul avatar Feb 24 '25 16:02 mykaul

Here's fio benchmark result on i7ie instances. I used this script to run fio benchmark, which is borrowed from Alibaba Cloud document, and I added settings to run 4k random read, 4k random write, 1024k sequential read, 1024k sequential write.

Below are results. It seems write IOPS on larger nodes are faster than iotune, but rest of them are almost same.

instance_type   read_iops       read_bandwidth  write_iops      write_bandwidth
i7ie.large.0    57358.897444    529908613       46014.93184     259432856
i7ie.large.1    57425.290804    530048639       46048.393548    259363116
i7ie.large.2    57363.097024    529908987       46013.29867     259441124
i7ie.xlarge.0   114826.156512   1059567362      91784.735877    519619300
i7ie.xlarge.1   114901.779763   1059531011      91780.403266    519585084
i7ie.xlarge.2   114817.590988   1059636413      91773.752791    519619610
i7ie.2xlarge.0  114499.933351   1060448511      91714.704608    519484064
i7ie.2xlarge.1  114628.811357   1060098813      91753.59035     519242601
i7ie.2xlarge.2  114634.264196   1059995135      91749.3336      519415015
i7ie.3xlarge.0  341885.356262   3164666192      271560.887822   1591678703
i7ie.3xlarge.1  341857.15714    3164807604      271270.670177   1592670126
i7ie.3xlarge.2  341912.547909   3164877365      271464.740385   1592007490
i7ie.6xlarge.0  341806.064645   3163721089      271102.225776   1592563132
i7ie.6xlarge.1  341803.378649   3163337088      271409.52254    1591173459
i7ie.6xlarge.2  341627.54898    3164103311      271284.876553   1591975327
i7ie.12xlarge.0 341811.719244   3162599725      271509.075166   1589655775
i7ie.12xlarge.1 341687.506245   3162152063      271441.783787   1590894341
i7ie.12xlarge.2 341788.307795   3161216516      271374.100719   1590740381
i7ie.18xlarge.0 341754.229386   3161028367      271493.025269   1590387473
i7ie.18xlarge.1 341806.79964    3161857281      271453.425661   1591915683
i7ie.18xlarge.2 341805.860806   3161339210      271366.006326   1589874483
i7ie.24xlarge.0 341971.82911    3159988937      271577.088258   1593055793
i7ie.24xlarge.1 341996.970201   3161060380      271490.462399   1589832572
i7ie.24xlarge.2 341951.683261   3160438313      271478.476546   1590657273
i7ie.48xlarge.0 342012.549933   3157185206      271378.191259   1590598469
i7ie.48xlarge.1 342010.387189   3160187071      271492.643632   1590283562
i7ie.48xlarge.2 341819.746347   3161257346      271371.080487   1590534718

i7ie.large      57382   529955413       46025   259412365
i7ie.xlarge     114848  1059578262      91779   519607998
i7ie.2xlarge    114587  1060180819      91739   519380560
i7ie.3xlarge    341885  3164783720      271432  1592118773
i7ie.6xlarge    341745  3163720496      271265  1591903972
i7ie.12xlarge   341762  3161989434      271441  1590430165
i7ie.18xlarge   341788  3161408286      271437  1590725879
i7ie.24xlarge   341973  3160495876      271515  1591181879
i7ie.48xlarge   341947  3159543207      271413  1590472249

(RAW result files are here)

syuu1228 avatar Mar 26 '25 20:03 syuu1228

And we can open a support issue with AWS to clarify that.

I tried to open an issue in AWS support but I got "You don't have the necessary IAM permissions to view that support case". Can anyone else who have enough permission to open the issue?

syuu1228 avatar Mar 27 '25 09:03 syuu1228

Please file a ticket with [email protected] to give you permissions to engage with AWS support.

avikivity avatar Mar 27 '25 10:03 avikivity

@syuu1228 I have direct contacts to the storage team working on i7ie and i7i. I'll forward you their information, please CC me on email thread to them.

roydahan avatar Mar 30 '25 21:03 roydahan

@roydahan - any news? Could it be that we are not setting RAID0 on i7ie.2xlarge?

mykaul avatar May 14 '25 09:05 mykaul

Also seem that:

i7ie.3xlarge    341885  3164783720      271432  1592118773
i7ie.6xlarge    341745  3163720496      271265  1591903972

has the same issue?

mykaul avatar May 14 '25 09:05 mykaul

@roydahan - any news? Could it be that we are not setting RAID0 on i7ie.2xlarge?

I don't remember I've seen an email. @yaronkaikov let's move it forward together.

roydahan avatar May 14 '25 15:05 roydahan

@syuu1228 I'm re-reading your original comment and I'm bit confused. You wrote:

I just ran iotune on i7ie.2xlarge again to make sure current parameter is correct, result was:

disks:

  • mountpoint: /var/lib/scylla read_iops: 117258 read_bandwidth: 1150152192 write_iops: 94138 write_bandwidth: 508599840 Seems like result is unchanged, it almost same performance as i7ie.xlarge.

These results don't seem to match Amazon's documentation, but isn't the parameters that Amazon's documentation indicates are the sum of all drives? If it correct, probably it seems to match our measurements.

The answer is that Amazon's documentation do indicates the sum of all drives. But I don't understand the second part, where you write that in match our measurements. How? The measurement you show above for i7i2.2xlarge is half of what they publish: read_iops: 117K vs 216K write_iops: 94K vs 173K.

Are you running the iotune using our AMI? on a raid0 of all disks? (IIUC what you wrote in the PR for i7i it seems that you don't...)

roydahan avatar May 14 '25 16:05 roydahan

@syuu1228 I'm re-reading your original comment and I'm bit confused. You wrote:

I just ran iotune on i7ie.2xlarge again to make sure current parameter is correct, result was: disks:

  • mountpoint: /var/lib/scylla read_iops: 117258 read_bandwidth: 1150152192 write_iops: 94138 write_bandwidth: 508599840 Seems like result is unchanged, it almost same performance as i7ie.xlarge.

These results don't seem to match Amazon's documentation, but isn't the parameters that Amazon's documentation indicates are the sum of all drives? If it correct, probably it seems to match our measurements.

The answer is that Amazon's documentation do indicates the sum of all drives. But I don't understand the second part, where you write that in match our measurements. How? The measurement you show above for i7i2.2xlarge is half of what they publish: read_iops: 117K vs 216K write_iops: 94K vs 173K.

Are you running the iotune using our AMI? on a raid0 of all disks? (IIUC what you wrote in the PR for i7i it seems that you don't...)

@roydahan My description was not good, I mean we measure on single disk performance, and then our setup code will mulitiply these parameters by number of disks.

            for p in ["read_iops", "read_bandwidth", "write_iops", "write_bandwidth"]:
                self.disk_properties[p] = io_params[t][p] * nr_disks
            self.save()

https://github.com/scylladb/scylla-machine-image/blob/next/common/scylla_cloud_io_setup#L58

On i7ie.2xlarge, read_iops = 117258 * 2 = 23516, write_iops = 94138 * 2 = 188276. So it will be read_iops: 235k vs 216k, write_iops: 188k vs 173k.

Measurement environment is Ubuntu 24.04 which we use for the base image, since our image automatically constructs RAID0, it's easier to measure single drive performance using the image.

syuu1228 avatar May 29 '25 09:05 syuu1228

I mean we measure on single disk performance, and then our setup code will mulitiply these parameters by number of disks.

            for p in ["read_iops", "read_bandwidth", "write_iops", "write_bandwidth"]:
                self.disk_properties[p] = io_params[t][p] * nr_disks
            self.save()

https://github.com/scylladb/scylla-machine-image/blob/next/common/scylla_cloud_io_setup#L58

On i7ie.2xlarge, read_iops = 117258 * 2 = 23516, write_iops = 94138 * 2 = 188276. So it will be read_iops: 235k vs 216k, write_iops: 188k vs 173k.

Here are the measurement results for a single drive, the measurement results for a single drive multiplied by the number of disks, and a comparison table with AWS specifications. Write IOPS appears to be slower than AWS specifications for 3xlarge and larger sizes. This is because the actual iotune measurement values for a single drive were slower than the specifications.

single drive performance

instance_type read_iops write_iops nr_disks
i7ie.large 58449 47145 1
i7ie.xlarge 117257 94180 1
i7ie.2xlarge 117267 94168 2
i7ie.3xlarge 352834 119327 1
i7ie.6xlarge 352824 119265 2
i7ie.12xlarge 352832 118937 4
i7ie.18xlarge 352818 119539 6
i7ie.24xlarge 352827 119278 8
i7ie.48xlarge 352826 119438 16

single drive performance * nr_disks  

instance_type read_iops write_iops AWS_read_iops[1] AWS_write_iops[1]
i7ie.large 58449 47145  54166 43333
i7ie.xlarge 117257 94180  108333 86666
i7ie.2xlarge 234534 188336  216666 173332
i7ie.3xlarge 352834 119327  325000 260000
i7ie.6xlarge 705648 238530  650000 520000
i7ie.12xlarge 1411328 475748  1300000 1040000
i7ie.18xlarge 2116908 717234  1950000 1560000
i7ie.24xlarge 2822616 954224  2600000 2080000
i7ie.48xlarge 5645216 1911008  5200000 4160000

*[1] From AWS specs.

syuu1228 avatar Jun 02 '25 07:06 syuu1228

I edited the tables above a bit, just to see the numbers side by side. It looks like our write performance can't keep up with what the node can provide, from i7ie.3xlarge and above! @xemul , @avikivity - thoughts?

mykaul avatar Jun 03 '25 14:06 mykaul

@syuu1228 where are the results of the "raid0" runs (not single disk). let's put them also side by side to this table.

In addition, suggesting the following edit to table to make it more readable:

single drive performance * nr_disks  

instance_type read_iops write_iops AWS_read_iops[1] AWS_write_iops[1]
i7ie.large 58K 47K  54K 43K  
i7ie.xlarge 117K 94K  108K 86K  
i7ie.2xlarge 235K 188K  217K 173K  
i7ie.3xlarge 353K 119K  325K 260K  
i7ie.6xlarge 706K 239K  650K 520K  
i7ie.12xlarge 1411K 476K  1300K 1040K  
i7ie.18xlarge 2117K 717K  1950K 1560K  
i7ie.24xlarge 2823K 954K  2600K 2080K  
i7ie.48xlarge 5645K 1911K  5200K 4160K  

Or better:

single drive performance * nr_disks  

instance_type read_iops AWS_read_iops[1] read_delta write_iops AWS_write_iops[1] write_delta
i7ie.large 58K 54K +7% 47K 43K +9%
i7ie.xlarge 117K 108K +8% 94K 86K +9%
i7ie.2xlarge 235K 217K +8% 188K 173K +9%
i7ie.3xlarge 353K 325K +9% 119K 260K -54%
i7ie.6xlarge 706K 650K +9% 239K 520K -54%
i7ie.12xlarge 1411K 1300K +9% 476K 1040K -54%
i7ie.18xlarge 2117K 1950K +9% 717K 1560K -54%
i7ie.24xlarge 2823K 2600K +9% 954K 2080K -54%
i7ie.48xlarge 5645K 5200K +9% 1911K 4160K -54%

roydahan avatar Jun 03 '25 22:06 roydahan

@syuu1228 where are the results of the "raid0" runs (not single disk). let's put them also side by side to this table.

Here's additional measurement iotune results, measured on RAID0 volume instead of single drive: (tested with RAID0, running on Scylla 2025.3.0-dev AMI)

Due to https://github.com/scylladb/scylla-machine-image/issues/723, i7ie.48xlarge only able to run benchmark 2 times, not 3 times.

instance_type read_iops read_bandwidth write_iops write_bandwidth
i7ie.large.0 58426 574715008 46887 251620544
i7ie.large.1 62314 574897088 47047 246629632
i7ie.large.2 58437 574944960 47069 247261568
i7ie.xlarge.0 117320 1148520576 94264 500478400
i7ie.xlarge.1 117225 1149501952 94215 503726432
i7ie.xlarge.2 117216 1149855872 94253 502346432
i7ie.2xlarge.0 234822 2288215808 188731 1012827200
i7ie.2xlarge.1 234855 2288272896 188179 1011206592
i7ie.2xlarge.2 234845 2288216576 188758 1007555712
i7ie.3xlarge.0 352780 3440667648 195347 1531361024
i7ie.3xlarge.1 352777 3440119808 195994 1531312384
i7ie.3xlarge.2 352782 3438665984 195243 1531802240
i7ie.6xlarge.0 525023 6747701760 352139 3050450688
i7ie.6xlarge.1 522884 6748208640 353432 3050736640
i7ie.6xlarge.2 525638 6747821568 358519 3050702592
i7ie.12xlarge.0 952244 8574459904 650959 6103974400
i7ie.12xlarge.1 951841 8574929920 651901 6104414208
i7ie.12xlarge.2 949660 8572749312 651515 6104256000
i7ie.18xlarge.0 1353136 8571264512 862961 8553358336
i7ie.18xlarge.1 1362974 8577720832 883500 8578407424
i7ie.18xlarge.2 1353274 8567787008 859629 8565970432
i7ie.24xlarge.0 1767879 8578226176 1039556 8578760192
i7ie.24xlarge.1 1770136 8579100672 1045217 8578906112
i7ie.24xlarge.2 1768215 8578795008 1041029 8579819520
i7ie.48xlarge.0 3383705 8477860352 2044940 8475563520
i7ie.48xlarge.1 3338124 8437319168 2032106 8404414464
i7ie.metal-24xl.0 1790041 8577441280 1048330 8559525888
i7ie.metal-24xl.1 1793329 8575400960 1042127 8570143232
i7ie.metal-24xl.2 1798110 8570591744 1039189 8576837632
i7ie.metal-48xl.0 3510983 8540174336 2017272 8541017600
i7ie.metal-48xl.1 3517379 8545589760 2016191 8545735680
i7ie.metal-48xl.2 3517068 8551666176 2036124 8549539840
i7ie.large avg 59725 574852352 47001 248503914
i7ie.xlarge avg 117253 1149292800 94244 502183754
i7ie.2xlarge avg 234840 2288235093 188556 1010529834
i7ie.3xlarge avg 352779 3439817813 195528 1531491882
i7ie.6xlarge avg 524515 6747910656 354696 3050629973
i7ie.12xlarge avg 951248 8574046378 651458 6104214869
i7ie.18xlarge avg 1356461 8572257450 868696 8565912064
i7ie.24xlarge avg 1768743 8578707285 1041934 8579161941
i7ie.48xlarge avg 3360914 8457589760 2038523 8439988992
i7ie.metal-24xl avg 1793826 8574477994 1043215 8568835584
i7ie.metal-48xl avg 3515143 8545810090 2023195 8545431040

syuu1228 avatar Jun 09 '25 17:06 syuu1228

The results above (https://github.com/scylladb/scylla-machine-image/issues/608#issuecomment-2956390772 ) show exactly what I had hoped to see. What prevents us from fixing this issue?

mykaul avatar Jul 10 '25 09:07 mykaul

@roydahan @yaronkaikov does it make sense to fix this bug together with this PR ?

lsfreitas avatar Aug 11 '25 20:08 lsfreitas

Yes, this one is going to be fixed as part of your PR, just mention it in your commit and it will be closed.

roydahan avatar Aug 11 '25 22:08 roydahan

Closing this issue as it was moved to Jira. Please continue the thread in https://scylladb.atlassian.net/browse/SMI-178

dani-tweig avatar Aug 25 '25 17:08 dani-tweig