Did s3randobj option change between v3.0-17 and v3.0-25
We've recently upgraded from v3.0-17 to v3.0-25 and noticed that the test with --s3randobj is no longer sending requests random enough to be uncached
All the runs with --s3randobj option are being successfully cached.
Are we supposed to be specifying some extra parameters for get the old functionality back or is there a bug introduced somewhere between v3.0-17 and v3.0-25
v3.0-25 example cmdline:
/usr/bin/elbencho --s3endpoints ${endpoint} --hosts elbencho-[0-4].elbencho.elbencho.svc.cluster.local --resfile /tmp/results.txt --csvfile /tmp/results.csv --live1n --livecsv /tmp/live_results.csv --liveint 30000 -r -t 40 -s 100m -b 10m -n 0 -N 20 --s3ignoreerrors --s3fastget --infloop --timelimit 900 --s3objprefix 100mb-obj --s3randobj elbencho
v3.0-25 throughput result (cached)
OPERATION RESULT TYPE FIRST DONE LAST DONE
=========== ================ ========== =========
READ: 0 files/s; 2439 MiB/s; 0 files; 73174 MiB; 200 threads; 4% CPU; 30s
READ: 0 files/s; 2384 MiB/s; 0 files; 144701 MiB; 200 threads; 4% CPU; 1m0s
READ: 0 files/s; 2365 MiB/s; 0 files; 215651 MiB; 200 threads; 4% CPU; 1m30s
v3.0-17 example cmdline:
/usr/bin/elbencho --s3endpoints ${endpoint} --s3key ${s3key} --s3secret ${s3secret} --hosts elbencho-[0-4].elbencho.elbencho.svc.cluster.local --resfile /tmp/results.txt --csvfile /tmp/results.csv --live1n --livecsv /tmp/live_results.csv --liveint 30000 -r -t 40 -s 100m -b 10m -n 0 -N 20 --s3ignoreerrors --s3fastget --infloop --timelimit 900 --s3objprefix 100mb-obj --s3randobj elbencho
v3.0-17 throughput result (uncached)
OPERATION RESULT TYPE FIRST DONE LAST DONE
=========== ================ ========== =========
READ: 0 files/s; 78 MiB/s; 0 files; 2343 MiB; 200 threads; 1% CPU; 30s
READ: 0 files/s; 78 MiB/s; 0 files; 4697 MiB; 200 threads; 2% CPU; 1m0s
READ: 0 files/s; 78 MiB/s; 0 files; 7059 MiB; 200 threads; 1% CPU; 1m30s
hi @russki , this is very surprising, because i'm not aware of any change in behavior for the "--s3randobj" option between those two versions. how about trying the "--opslog" option to check that the accesses are random as intended?
@russki : does it help if you add --randalgo balanced to the command? i just became aware that the new linear congruential generator for random numbers from v3.0.19 (which was supposed to help with full range coverage) is less random for certain parameters.
Hi @russki , were you able to find out anything new about this?
@breuner so sorry for the delay, tried it again today, still doesn't work as expected on v3.0.25
for the same test
elbencho:v3.0-17
COMMAND LINE:
"/usr/bin/elbencho" "--s3endpoints" "$endpoint" "--s3key" "xxxxx" "--s3secret" "xxxxx" "--hosts" "elbencho-[0-4].elbencho.elbencho.svc.cluster.local" "--resfile" "/tmp/results.txt" "--csvfile" "/tmp/results.csv" "--live1n" "--livecsv" "/tmp/live_results.csv" "--liveint" "30000" "-r" "-t" "40" "-s" "100m" "-b" "10m" "-n" "0" "-N" "20" "--label" "$endpoint" "--s3ignoreerrors" "--s3fastget" "--infloop" "--timelimit" "900" "--s3objprefix" "100mb" "--s3randobj" "elbencho"
Results show uncached performance
OPERATION RESULT TYPE FIRST DONE LAST DONE
=========== ================ ========== =========
READ Elapsed time : 15m0.273s 15m13.156s
IOPS : 14 14
Throughput MiB/s : 80 78
Total MiB : 72052 72098
elbencho:v3.0-25
COMMAND LINE with --s3randobj --randalgo balanced
"/usr/bin/elbencho" "--s3endpoints" "$endpoint" "--hosts" "elbencho-[0-4].elbencho.elbencho.svc.cluster.local" "--resfile" "/tmp/results.txt" "--csvfile" "/tmp/results.csv" "--live1n" "--livecsv" "/tmp/live_results.csv" "--liveint" "30000" "-r" "-t" "40" "-s" "100m" "-b" "10m" "-n" "0" "-N" "20" "--label" "$endpoint" "--s3ignoreerrors" "--s3fastget" "--infloop" "--timelimit" "900" "--s3objprefix" "100mb" "--s3randobj" "--randalgo" "balanced" "elbencho"
Results show cached performance, --s3randobj is no longer working as expected
OPERATION RESULT TYPE FIRST DONE LAST DONE
=========== ================ ========== =========
READ Elapsed time : 15m0.250s 15m13.004s
IOPS : 241 237
Throughput MiB/s : 2385 2353
Total MiB : 2147650 2148733
@russki : This is still giving me a headache because I cannot find any difference between v3.0-17 and later releases like v3.0-25. When using -r --s3randobj then both releases correctly select random objects and select random offsets within those objects in all the tests that I ran.
Could you please upload the --opslog file and the --livecsv file here so that I can confirm the random selection? The --opslog /path/to/logfile.txt will get added as a paramter to the commandline that you sent, but the file will then get stored locally under this path by the 5 service instances that you have in your command. Probably one of the files will be enough, but if they are not too big then sending all 5 also won't hurt. (Opslog and livecsv files do not contain any sensitive information.)
I guess it is safe to assume that you are using the exact same value for --s3endpoints in both cases. The dataset that you create is 5 clients x 40 threads x 20 objects x 100 MB = 400GB. Is the memory of the server (or servers) in your test case large enough to cache this amount of data? But even if it is, then it still doesn't seem like the main explanation here, because the other test with 3.0-17 was only able to read about 72GB, so it would take multiple iterations of 15min to read the entire dataset into the cache (assuming nothing was in the cache initially and assuming the server does not do prefetching of an entire object if a random part of the object is being read).
Hi @russki , I'm closing this issue after some time. If there is anything new to add then of course please feel free to re-open this issue or a new one.