Move mutation_fragment_v2::kind into mutation_fragment_v2::data, mutation_fragment::kind into mutation_fragment::data
Move mutation_fragment_v2::kind field into mutation_fragment_v2::data. Move mutation_fragment::kind field into mutation_fragment::data.
In both cases the move reduces size of the object by half (to 8 bytes).
This is partial fix to https://github.com/scylladb/scylla-enterprise/issues/5288 issue.
Refs is missing, if relevant. Backport? I assume it's improvement, so maybe no?
@mykaul refs as link to original issue?
Yea, improvement, i guess backport woudn't hurt. What do i do to (hopefully automatically) backport this patch to other versions?
@mykaul refs as link to original issue?
Yea, improvement, i guess backport woudn't hurt. What do i do to (hopefully automatically) backport this patch to other versions?
Add relevant backport/version label and a convincing argument. Wouldn't hurt isn't a good one :-)
We don't backport minor improvements.
@mykaul refs as link to original issue?
Yea, improvement, i guess backport woudn't hurt. What do i do to (hopefully automatically) backport this patch to other versions?
Backports always hurt. That's why we have a backport reason field. Or we would backport every single commit.
I don't think we need to touch mutation_fragment (not _v2), since it's practically unused (it's used in mixed clusters).
I think only the second patch actually does anything.
I don't expect any regressions, but please run perf-simple-query to verify. Sample command line: scylla perf-simple-query --smp 1 -m 1G. Typically QPS is too noisy so instructions-per-op is the interesting metric.
:red_circle: CI State: FAILURE
:white_check_mark: - Framework test :white_check_mark: - Build :white_check_mark: - dtest with gossip topology changes :white_check_mark: - dtest with consistent topology changes :white_check_mark: - dtest with tablets :x: - Unit Tests
Failed Tests (1/43158):
Build Details:
- Duration: 6 hr 48 min
- Builder: i-00ff68defd9533bd5 (m5d.16xlarge)
My sugestions:
- Improve justification for the changes in the lead message. One cannot know what are these
mutation_fragments used in without inspecting the source first or without their own prior experience; - Reference the issue in the lead message, indicating it is only a partial fix.
There are already 3 code reviewers assigned, so that's it from me.
:red_circle: CI State: FAILURE
:white_check_mark: - Framework test :x: - Build
Build Details:
- Duration: 44 min
- Builder: i-0f4a4a369a71222cc (m5d.16xlarge)
@avikivity I run the command you suggested against master and my patch:
master:
192660.09 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48340 insns/op, 22552 cycles/op, 0 errors)
my patch:
195584.84 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48355 insns/op, 22547 cycles/op, 0 errors)
the values for insns/op don't change much (+-0.2% - around +-100), i assume it's fine.
I don't think we need to touch mutation_fragment (not _v2), since it's practically unused (it's used in mixed clusters).
I think only the second patch actually does anything.
Unless there're good reasons i'd keep both changes for completness. Otherwise next noob like me will be going what-is-going-here and will propose the missing change anyway.
:red_circle: CI State: FAILURE
:white_check_mark: - Framework test :x: - Build
Build Details:
- Duration: 1 hr 52 min
- Builder: spider4.cloudius-systems.com
@avikivity I run the command you suggested against master and my patch:
master:
192660.09 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48340 insns/op, 22552 cycles/op, 0 errors)my patch:
195584.84 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48355 insns/op, 22547 cycles/op, 0 errors)the values for
insns/opdon't change much (+-0.2% - around +-100), i assume it's fine.
100 insns for such a small change is crazy. But it actually shows 15 insns/op (which is within noise).
I don't think we need to touch mutation_fragment (not _v2), since it's practically unused (it's used in mixed clusters). I think only the second patch actually does anything.
Unless there're good reasons i'd keep both changes for completness. Otherwise next noob like me will be going
what-is-going-hereand will propose the missing change anyway.
Ok. There's some risk here, since the non-v1 code is hardly used so the change won't get much testing.
100 insns for such a small change is crazy. But it actually shows 15 insns/op (which is within noise).
I meant for the single run that line is printed like 5 times and those 5 values tend to differ in 100 range (so let's say in 48290 - 48390 range). For example:
189729.71 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48311 insns/op, 22973 cycles/op, 0 errors)
INFO 2025-03-27 14:55:05,021 [shard 0:main] group0_tombstone_gc_handler - Setting reconcile time to 1743083702 (min id=14d8063e-0b13-11f0-f84e-33a407421cd0)
191495.45 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48369 insns/op, 22849 cycles/op, 0 errors)
194562.57 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48342 insns/op, 22698 cycles/op, 0 errors)
193388.16 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48353 insns/op, 22666 cycles/op, 0 errors)
192660.09 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48340 insns/op, 22552 cycles/op, 0 errors)
insns/op seems to be using PERF_COUNT_HW_INSTRUCTIONS underneath. I'm curious, why are we checking it? I'd expect it to be slightly bigger, because i'd expect code output to get two instructions instead of one to fetch kind. I'd also expect it to be completely meaningless, because in normal operation (ie when scylla actually does something interesting on more than one core) i'd expect scylla to be memory starved (like anything else pretty much) and thus additional instruction is essentially free. My performance worry would be additional memory hop i've introduced (you need to read memory to pointer to read memory to get kind, but the source code usually reads read of the pointed data anyway, so this should be free as well. Am i missing something?
:red_circle: CI State: FAILURE
:white_check_mark: - Framework test :x: - Build :x: - Code Coverage
Build Details:
- Duration: 1 hr 5 min
- Builder: spider5.cloudius-systems.com
:red_circle: CI State: FAILURE
:white_check_mark: - Framework test :x: - Build :x: - Code Coverage
Build Details:
- Duration: 1 hr 22 min
- Builder: spider2.cloudius-systems.com
:green_circle: CI State: SUCCESS
:white_check_mark: - Framework test :white_check_mark: - Build :white_check_mark: - dtest with consistent topology changes :white_check_mark: - dtest with tablets :white_check_mark: - dtest with gossip topology changes :white_check_mark: - Unit Tests
Build Details:
- Duration: 6 hr 20 min
- Builder: spider8.cloudius-systems.com
:red_circle: CI State: FAILURE
:white_check_mark: - Framework test :white_check_mark: - Build :x: - dtest with tablets :x: - dtest with gossip topology changes :x: - dtest with consistent topology changes :white_check_mark: - Unit Tests :x: - Code Coverage
Build Details:
- Duration: 6 hr 48 min
- Builder: spider7.cloudius-systems.com
100 insns for such a small change is crazy. But it actually shows 15 insns/op (which is within noise).
I meant for the single run that line is printed like 5 times and those 5 values tend to differ in 100 range (so let's say in 48290 - 48390 range). For example:
Those are 5 different 1-second samples.
189729.71 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48311 insns/op, 22973 cycles/op, 0 errors) INFO 2025-03-27 14:55:05,021 [shard 0:main] group0_tombstone_gc_handler - Setting reconcile time to 1743083702 (min id=14d8063e-0b13-11f0-f84e-33a407421cd0) 191495.45 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48369 insns/op, 22849 cycles/op, 0 errors) 194562.57 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48342 insns/op, 22698 cycles/op, 0 errors) 193388.16 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48353 insns/op, 22666 cycles/op, 0 errors) 192660.09 tps ( 71.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 48340 insns/op, 22552 cycles/op, 0 errors)
Deviation is a lot less than 100.
insns/opseems to be using PERF_COUNT_HW_INSTRUCTIONS underneath. I'm curious, why are we checking it? I'd expect it to be slightly bigger, because i'd expect code output to get two instructions instead of one to fetch kind. I'd also expect it to be completely meaningless, because in normal operation (ie when scylla actually does something interesting on more than one core) i'd expect scylla to be memory starved (like anything else pretty much) and thus additional instruction is essentially free.
Our worst enemy is instruction cache pressure and so extra instruction footprint is expensive. You can read more here.
My performance worry would be additional memory hop i've introduced (you need to read memory to pointer to read memory to get kind, but the source code usually reads read of the pointed data anyway, so this should be free as well. Am i missing something?
No, I there's no data cache problems here.
(well I can bring up a problem: the branch on ->kind is not well predicted since it's user data dependent, so it takes longer to resolve a mispredict and if the processor can't find other work it will have bubbles)
:green_circle: CI State: SUCCESS
:white_check_mark: - Framework test :white_check_mark: - Build :white_check_mark: - dtest with tablets :white_check_mark: - dtest with consistent topology changes :white_check_mark: - dtest with gossip topology changes :white_check_mark: - Unit Tests
Build Details:
- Duration: 7 hr 42 min
- Builder: spider2.cloudius-systems.com
What's the code coverage failure? @yaronkaikov every day some new random thing.
Code coverage build is broken (was it ever working?)
clang++ -MD -MT build/release/main.o -MF build/release/main.o.d -std=gnu++23 -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/release/seastar/gen/include -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_LOGGER_TYPE_STDOUT -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -I/usr/include/p11-kit-1 -g -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -DSEASTAR_NO_EXCEPTION_HACK -march=westmere -Wno-backend-plugin -flto=thin -ffat-lto-objects -fprofile-use=/jenkins/workspace/scylla-master/scylla-ci/scylla/build/profile.profdata -mllvm -pgso=false -mllvm -enable-value-profiling=false -ffunction-sections -fdata-sections -fprofile-instr-generate -fcoverage-mapping -fprofile-list=./coverage_sources.list -O3 -mllvm -inline-threshold=2500 -fno-slp-vectorize -DSCYLLA_BUILD_MODE=release -g -gz -Xclang -fexperimental-assignment-tracking=disabled -iquote. -iquote build/release/gen -std=gnu++23 -g -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -DSEASTAR_NO_EXCEPTION_HACK -march=westmere -DBOOST_ALL_DYN_LINK -fvisibility=hidden -Ikmipc/kmipc-2.1.0t-rhel84_64/include -DHAVE_KMIP -isystem abseil -Wall -Werror -Wextra -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-parameter -Wno-unsupported-friend -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-enum-constexpr-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -c -o build/release/main.o main.cc
clang++: error: invalid argument '-fprofile-instr-generate' not allowed with '-fprofile-use='
What's the code coverage failure? @yaronkaikov every day some new random thing.
i am not sure why @radoslawcybulski checked this option manually in the Scylla-CI job :-), but it;s not yet working (this is why it set to FALSE by default)
Code coverage build is broken (was it ever working?)
clang++ -MD -MT build/release/main.o -MF build/release/main.o.d -std=gnu++23 -I/jenkins/workspace/scylla-master/scylla-ci/scylla/seastar/include -I/jenkins/workspace/scylla-master/scylla-ci/scylla/build/release/seastar/gen/include -Werror=unused-result -DSEASTAR_API_LEVEL=7 -DSEASTAR_SSTRING -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_SCHEDULING_GROUPS_COUNT=19 -DSEASTAR_LOGGER_TYPE_STDOUT -DBOOST_PROGRAM_OPTIONS_NO_LIB -DBOOST_PROGRAM_OPTIONS_DYN_LINK -DBOOST_THREAD_NO_LIB -DBOOST_THREAD_DYN_LINK -DFMT_SHARED -I/usr/include/p11-kit-1 -g -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -DSEASTAR_NO_EXCEPTION_HACK -march=westmere -Wno-backend-plugin -flto=thin -ffat-lto-objects -fprofile-use=/jenkins/workspace/scylla-master/scylla-ci/scylla/build/profile.profdata -mllvm -pgso=false -mllvm -enable-value-profiling=false -ffunction-sections -fdata-sections -fprofile-instr-generate -fcoverage-mapping -fprofile-list=./coverage_sources.list -O3 -mllvm -inline-threshold=2500 -fno-slp-vectorize -DSCYLLA_BUILD_MODE=release -g -gz -Xclang -fexperimental-assignment-tracking=disabled -iquote. -iquote build/release/gen -std=gnu++23 -g -ffile-prefix-map=/jenkins/workspace/scylla-master/scylla-ci/scylla=. -DSEASTAR_NO_EXCEPTION_HACK -march=westmere -DBOOST_ALL_DYN_LINK -fvisibility=hidden -Ikmipc/kmipc-2.1.0t-rhel84_64/include -DHAVE_KMIP -isystem abseil -Wall -Werror -Wextra -Wimplicit-fallthrough -Wno-mismatched-tags -Wno-c++11-narrowing -Wno-overloaded-virtual -Wno-unused-parameter -Wno-unsupported-friend -Wno-missing-field-initializers -Wno-deprecated-copy -Wno-enum-constexpr-conversion -Wno-error=deprecated-declarations -DXXH_PRIVATE_API -DSEASTAR_TESTING_MAIN -c -o build/release/main.o main.cc clang++: error: invalid argument '-fprofile-instr-generate' not allowed with '-fprofile-use='
it is not broken as it never worked. we have a task to deal with it (Q4 i believe). Please ignore it,
@radoslawcybulski was experimenting with code coverage. Unfortunately he didn't manage to complete it as he got sick and will be out of office this week.
I started with the coverage button on the jenkins page, it doesn't work as well, because of assert in some python library, which does lcov files manipulation. Here is an error from old run:
15:10:39 FileNotFoundError: [Errno 2] No such file or directory: '/jenkins/workspace/scylla-master/scylla-ci/scylla/testlog/x86_64/test_coverage.info'
(the log is here https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/16457/consoleFull)
I've managed to configure code coverage in "normal" way, but this will give you report for all files across scylladb, which i don't think is what we need. There's no easy, already-done way (i can find) to limit coderage output only to changes in given branch. I assume this is what the python library tries to attempt (note, that lcov internal format is up to change at any moment, so the whole thing might be fragile).
:green_circle: CI State: SUCCESS
:white_check_mark: - Framework test :white_check_mark: - Build :white_check_mark: - dtest with consistent topology changes :white_check_mark: - dtest with tablets :white_check_mark: - dtest with gossip topology changes :white_check_mark: - Unit Tests
Build Details:
- Duration: 6 hr 54 min
- Builder: i-04538998a930d6c38 (m5d.16xlarge)
@denesb please review and in particular whether it interferes or interacts, or doesn't, with your renaming PR https://github.com/scylladb/scylladb/pull/23767
I started with the coverage button on the jenkins page, it doesn't work as well, because of assert in some python library, which does lcov files manipulation. Here is an error from old run:
15:10:39 FileNotFoundError: [Errno 2] No such file or directory: '/jenkins/workspace/scylla-master/scylla-ci/scylla/testlog/x86_64/test_coverage.info'
(the log is here https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/16457/consoleFull)
Presumably, if someone wants to, this code, which was written by an employee that has left (Eliran), can be fixed without much difficulty. He already did a lot of work on this, and presumably fixing some small change in the format will be easier than redoing everything from scratch.
I've managed to configure code coverage in "normal" way, but this will give you report for all files across scylladb, which i don't think is what we need.
This is also useful, and in fact, I always personally found that - the ability to do code coverage on the entire Scylla - be the more interesting feature, because it allows to find code in the existing code base which needs more testing. I think it is important for managers or maintainers dealing with a particular subsystem (e.g., Alternator, Repair, or whatever) to know which part of their code isn't being tested and needs more tests. This is especially important before thinking of doing refactoring on any code, because if you refactor untested code, you will have no idea if you broke it.
There's no easy, already-done way (i can find) to limit coderage output only to changes in given branch.
Yes, and this is exactly was one of the new things that Eliran did in this work - being able to generate a nice view of whether new lines in a given patch got tested or not. I had some disagreement with him on whether this was the most important use for code coverage, but I do agree that even if it's not the most important, it is a useful feature to have. But no useful feature can live if nobody cares about it an maintain it.
I assume this is what the python library tries to attempt (note, that lcov internal format is up to change at any moment, so the whole thing might be fragile).
I guess so, but something being "fragile" doesn't automatically mean it's bad. Fragile and good is better than stable crap - but it does require somebody to care about it and continue maintaining it. Did you see docs/dev/code-coverage.md where Eliran explained what he was doing?