elbencho icon indicating copy to clipboard operation
elbencho copied to clipboard

Did s3randobj option change between v3.0-17 and v3.0-25

Open russki opened this issue 11 months ago • 5 comments

We've recently upgraded from v3.0-17 to v3.0-25 and noticed that the test with --s3randobj is no longer sending requests random enough to be uncached

All the runs with --s3randobj option are being successfully cached.

Are we supposed to be specifying some extra parameters for get the old functionality back or is there a bug introduced somewhere between v3.0-17 and v3.0-25

v3.0-25 example cmdline:

/usr/bin/elbencho --s3endpoints ${endpoint} --hosts elbencho-[0-4].elbencho.elbencho.svc.cluster.local --resfile /tmp/results.txt --csvfile /tmp/results.csv --live1n --livecsv /tmp/live_results.csv --liveint 30000 -r -t 40 -s 100m -b 10m -n 0 -N 20 --s3ignoreerrors --s3fastget --infloop --timelimit 900 --s3objprefix 100mb-obj --s3randobj elbencho

v3.0-25 throughput result (cached)

OPERATION   RESULT TYPE         FIRST DONE   LAST DONE
=========== ================    ==========   =========
READ: 0 files/s; 2439 MiB/s; 0 files; 73174 MiB; 200 threads; 4% CPU; 30s
READ: 0 files/s; 2384 MiB/s; 0 files; 144701 MiB; 200 threads; 4% CPU; 1m0s
READ: 0 files/s; 2365 MiB/s; 0 files; 215651 MiB; 200 threads; 4% CPU; 1m30s

v3.0-17 example cmdline:

/usr/bin/elbencho --s3endpoints ${endpoint} --s3key ${s3key} --s3secret ${s3secret} --hosts elbencho-[0-4].elbencho.elbencho.svc.cluster.local --resfile /tmp/results.txt --csvfile /tmp/results.csv --live1n --livecsv /tmp/live_results.csv --liveint 30000 -r -t 40 -s 100m -b 10m -n 0 -N 20 --s3ignoreerrors --s3fastget --infloop --timelimit 900 --s3objprefix 100mb-obj --s3randobj elbencho

v3.0-17 throughput result (uncached)

OPERATION   RESULT TYPE         FIRST DONE   LAST DONE
=========== ================    ==========   =========
READ: 0 files/s; 78 MiB/s; 0 files; 2343 MiB; 200 threads; 1% CPU; 30s
READ: 0 files/s; 78 MiB/s; 0 files; 4697 MiB; 200 threads; 2% CPU; 1m0s
READ: 0 files/s; 78 MiB/s; 0 files; 7059 MiB; 200 threads; 1% CPU; 1m30s

russki avatar Mar 08 '25 01:03 russki

hi @russki , this is very surprising, because i'm not aware of any change in behavior for the "--s3randobj" option between those two versions. how about trying the "--opslog" option to check that the accesses are random as intended?

breuner avatar Mar 17 '25 20:03 breuner

@russki : does it help if you add --randalgo balanced to the command? i just became aware that the new linear congruential generator for random numbers from v3.0.19 (which was supposed to help with full range coverage) is less random for certain parameters.

breuner avatar Apr 07 '25 09:04 breuner

Hi @russki , were you able to find out anything new about this?

breuner avatar May 22 '25 21:05 breuner

@breuner so sorry for the delay, tried it again today, still doesn't work as expected on v3.0.25

for the same test

elbencho:v3.0-17

COMMAND LINE:

"/usr/bin/elbencho" "--s3endpoints" "$endpoint" "--s3key" "xxxxx" "--s3secret" "xxxxx" "--hosts" "elbencho-[0-4].elbencho.elbencho.svc.cluster.local" "--resfile" "/tmp/results.txt" "--csvfile" "/tmp/results.csv" "--live1n" "--livecsv" "/tmp/live_results.csv" "--liveint" "30000" "-r" "-t" "40" "-s" "100m" "-b" "10m" "-n" "0" "-N" "20" "--label" "$endpoint" "--s3ignoreerrors" "--s3fastget" "--infloop" "--timelimit" "900" "--s3objprefix" "100mb" "--s3randobj" "elbencho"

Results show uncached performance

OPERATION   RESULT TYPE         FIRST DONE   LAST DONE
=========== ================    ==========   =========
READ        Elapsed time     :   15m0.273s  15m13.156s
            IOPS             :          14          14
            Throughput MiB/s :          80          78
            Total MiB        :       72052       72098

elbencho:v3.0-25

COMMAND LINE with --s3randobj --randalgo balanced

"/usr/bin/elbencho" "--s3endpoints" "$endpoint" "--hosts" "elbencho-[0-4].elbencho.elbencho.svc.cluster.local" "--resfile" "/tmp/results.txt" "--csvfile" "/tmp/results.csv" "--live1n" "--livecsv" "/tmp/live_results.csv" "--liveint" "30000" "-r" "-t" "40" "-s" "100m" "-b" "10m" "-n" "0" "-N" "20" "--label" "$endpoint" "--s3ignoreerrors" "--s3fastget" "--infloop" "--timelimit" "900" "--s3objprefix" "100mb" "--s3randobj" "--randalgo" "balanced" "elbencho"

Results show cached performance, --s3randobj is no longer working as expected

OPERATION   RESULT TYPE         FIRST DONE   LAST DONE
=========== ================    ==========   =========
READ        Elapsed time     :   15m0.250s  15m13.004s
            IOPS             :         241         237
            Throughput MiB/s :        2385        2353
            Total MiB        :     2147650     2148733

russki avatar May 30 '25 21:05 russki

@russki : This is still giving me a headache because I cannot find any difference between v3.0-17 and later releases like v3.0-25. When using -r --s3randobj then both releases correctly select random objects and select random offsets within those objects in all the tests that I ran.

Could you please upload the --opslog file and the --livecsv file here so that I can confirm the random selection? The --opslog /path/to/logfile.txt will get added as a paramter to the commandline that you sent, but the file will then get stored locally under this path by the 5 service instances that you have in your command. Probably one of the files will be enough, but if they are not too big then sending all 5 also won't hurt. (Opslog and livecsv files do not contain any sensitive information.)

I guess it is safe to assume that you are using the exact same value for --s3endpoints in both cases. The dataset that you create is 5 clients x 40 threads x 20 objects x 100 MB = 400GB. Is the memory of the server (or servers) in your test case large enough to cache this amount of data? But even if it is, then it still doesn't seem like the main explanation here, because the other test with 3.0-17 was only able to read about 72GB, so it would take multiple iterations of 15min to read the entire dataset into the cache (assuming nothing was in the cache initially and assuming the server does not do prefetching of an entire object if a random part of the object is being read).

breuner avatar Jun 15 '25 15:06 breuner

Hi @russki , I'm closing this issue after some time. If there is anything new to add then of course please feel free to re-open this issue or a new one.

breuner avatar Feb 09 '26 17:02 breuner