fix(scan): SCAN MATCH may incorrectly skip certain keys in some case
It closes #3023.
Since there are no comments or related tests, I don't know what this old code does, but it seems to be for acceleration. However, it appears that when the key is not found and the HASH_SLOTS_MAX_ITERATIONS limit is triggered, this might cause some keys not to be scanned, leading to the 3023 bug.
if (slot_id > slot_start + HASH_SLOTS_MAX_ITERATIONS) {
if (keys->empty()) {
if (iter->Valid()) {
std::tie(std::ignore, user_key) = ExtractNamespaceKey<std::string>(iter->key(), storage_->IsSlotIdEncoded());
auto res = std::mismatch(prefix.begin(), prefix.end(), user_key.begin());
if (res.first == prefix.end() && util::StringMatch(suffix_glob, user_key.substr(prefix.size()))) {
keys->emplace_back(user_key);
}
end_cursor->append(user_key);
}
} else {
end_cursor->append(user_key);
}
The main modifications are the addition of tests and changes to redis_db.cc, while others are caused by ./x.py format or test packages change.
For the testing, I copied most of the previous scan tests, created a file to retest under the condition of a (2*master-slave) cluster, and added the use of the go-redis clusterclient and its related code.
Additional Questions
1.In Go tests, it may be possible to use the UniversalClient to unify the clients for both cluster and single-node modes.
2.In the use cases other than replication, it seems that testing under the cluster scheme has not been attempted, which may pose risks. The implementation of some commands like scan may be related to the cluster to a certain extent.
@git-hulk Please review the code, thank you.
@PragmaTwice please close this and i will create a new pr, thanks. If there are other precautions, you can let me know. I'm using Linux + Neovim.
You can push new commits to this branch instead of closing it. (we'll squash them while merging.)
Also you can use ./x.py format with clang-format-14 to format your code.
@PragmaTwice These code-style changes seem to be caused by executing ./x.py format.
exec ./x.py format in 8a50e51c
@PragmaTwice Then can I not execute the format?
Could you confirm that version of your clang-format is 14?
We use this version in our CI, and it will make the CI fail if the format behavior is different.
@PragmaTwice Sorry, my clang-format version is 20.1.4, Let me see how to use clang-format-14 to format the code.
You can download it here: https://github.com/llvm/llvm-project/releases/tag/llvmorg-14.0.6
@PragmaTwice Should the changes to the go.sum file be reverted?
@PragmaTwice I found the scan match xxxx much slower than before version in some case. I need to find the reason about it, Please do not merge
@git-hulk @PragmaTwice Hi, I have a question about the string set in single mode that why some characters miss after AppendNamespacePrefix
rocksdb::Status String::Set(engine::Context &ctx, const std::string &user_key, const std::string &value,
StringSetArgs args, std::optional<std::string> &ret) {
uint64_t expire = 0;
info("Set key: {}, value: {}, expire: {}, type: {}, get: {}, keep_ttl: {}", user_key, value, args.expire,
static_cast<int>(args.type), args.get, args.keep_ttl);
std::string ns_key = AppendNamespacePrefix(user_key);
info("Set key with namespace: {}, value: {}, expire: {}, type: {}, get: {}, keep_ttl: {}", ns_key, value, args.expire,
static_cast<int>(args.type), args.get, args.keep_ttl);
bool need_old_value = args.type != StringSetType::NONE || args.get || args.keep_ttl;
if (need_old_value) {
std::string old_value;
uint64_t old_expire = 0;
auto s = getValueAndExpire(ctx, ns_key, &old_value, &old_expire);
if (!s.ok() && !s.IsNotFound() && !s.IsInvalidArgument()) return s;
// GET option
if (args.get) {
if (s.IsInvalidArgument()) {
return s;
}
if (s.IsNotFound()) {
// if GET option given, the key didn't exist before: return nil
ret = std::nullopt;
} else {
// if GET option given: return The previous value of the key.
ret = std::make_optional(old_value);
}
}
// NX/XX option
if (args.type == StringSetType::NX && s.ok()) {
// if NX option given, the key already exist: return nil
if (!args.get) ret = std::nullopt;
return rocksdb::Status::OK();
} else if (args.type == StringSetType::XX && s.IsNotFound()) {
// if XX option given, the key didn't exist before: return nil
if (!args.get) ret = std::nullopt;
return rocksdb::Status::OK();
} else {
// if GET option not given, make ret not nil
if (!args.get) ret = "";
}
if (s.ok() && args.keep_ttl) {
// if KEEPTTL option given, use the old ttl
expire = old_expire;
}
} else {
// if no option given, make ret not nil
if (!args.get) ret = "";
}
// Handle expire time
if (!args.keep_ttl) {
expire = args.expire;
}
// Create new value
std::string new_raw_value;
Metadata metadata(kRedisString, false);
metadata.expire = expire;
metadata.Encode(&new_raw_value);
new_raw_value.append(value);
return updateRawValue(ctx, ns_key, new_raw_value);
}
@git-hulk @PragmaTwice Hi, I have a question about the string set in single mode that why some characters miss after
AppendNamespacePrefixrocksdb::Status String::Set(engine::Context &ctx, const std::string &user_key, const std::string &value, StringSetArgs args, std::optional<std::string> &ret) { uint64_t expire = 0; info("Set key: {}, value: {}, expire: {}, type: {}, get: {}, keep_ttl: {}", user_key, value, args.expire, static_cast<int>(args.type), args.get, args.keep_ttl); std::string ns_key = AppendNamespacePrefix(user_key); info("Set key with namespace: {}, value: {}, expire: {}, type: {}, get: {}, keep_ttl: {}", ns_key, value, args.expire, static_cast<int>(args.type), args.get, args.keep_ttl); bool need_old_value = args.type != StringSetType::NONE || args.get || args.keep_ttl; if (need_old_value) { std::string old_value; uint64_t old_expire = 0; auto s = getValueAndExpire(ctx, ns_key, &old_value, &old_expire); if (!s.ok() && !s.IsNotFound() && !s.IsInvalidArgument()) return s; // GET option if (args.get) { if (s.IsInvalidArgument()) { return s; } if (s.IsNotFound()) { // if GET option given, the key didn't exist before: return nil ret = std::nullopt; } else { // if GET option given: return The previous value of the key. ret = std::make_optional(old_value); } } // NX/XX option if (args.type == StringSetType::NX && s.ok()) { // if NX option given, the key already exist: return nil if (!args.get) ret = std::nullopt; return rocksdb::Status::OK(); } else if (args.type == StringSetType::XX && s.IsNotFound()) { // if XX option given, the key didn't exist before: return nil if (!args.get) ret = std::nullopt; return rocksdb::Status::OK(); } else { // if GET option not given, make ret not nil if (!args.get) ret = ""; } if (s.ok() && args.keep_ttl) { // if KEEPTTL option given, use the old ttl expire = old_expire; } } else { // if no option given, make ret not nil if (!args.get) ret = ""; } // Handle expire time if (!args.keep_ttl) { expire = args.expire; } // Create new value std::string new_raw_value; Metadata metadata(kRedisString, false); metadata.expire = expire; metadata.Encode(&new_raw_value); new_raw_value.append(value); return updateRawValue(ctx, ns_key, new_raw_value); }
@Genuineh, I'm sorry for not quite understanding what the characters are missing.
@git-hulk Sorry, it's my problem. I have blurry vision.
@git-hulk The real cause of this bug, which resulted from a misunderstanding in the previous modification, should be the logical omission in the filtering of keys containing slot information under the cluster mode, leading to matching issues of keys with different prefixes in the same slot. This submission mainly adds the adaptation of matching strategies in related scenarios and relevant tests under the cluster environment. Preliminary simple performance tests have shown no obvious performance degradation. Due to the lack of a suitable performance benchmark testing project, larger-scale performance tests have not been carried out. In addition, as I am not familiar with C++, I am not sure whether the string allocation and matching strategies are optimal. Please focus on this, as string operations may be a performance-consuming part in loops. Thank you for your patience.
@git-hulk I found a new problem of scan when input the empty prefix in unstable branch. The go test case like this
t.Run("SCAN MATCH for empty prefix", func(t *testing.T) {
require.NoError(t, rdb.FlushDB(ctx).Err())
util.Populate(t, rdb, "key:", 10, 10)
keys := scanAll(t, rdb, "match", "")
require.Len(t, keys, 10)
})
error
=== RUN TestScanWithNumberCursor/SCAN_MATCH_for_empty_prefix
/home/jerryg/github/forked/kvrocks/tests/gocase/unit/scan/scan_test.go:101:
Error Trace: /home/jerryg/github/forked/kvrocks/tests/gocase/unit/scan/scan_test.go:101
Error: "[]" should have 10 item(s), but has 0
Test: TestScanWithNumberCursor/SCAN_MATCH_for_empty_prefix
Need to fix it? Under normal circumstances, it should be rare for the input to be empty. But in Redis, it seems that this is supported. From the perspective of Redis compatibility, it is best to fix this issue, which occurs in both standalone and cluster scenarios. If necessary, I think a new merge request should be opened to address this issue after this submission is merged.
@git-hulk The test case that reports the error by CI is copied from the old test case SCAN MATCH non-trivial pattern in scan_test.go; Why does the check pass on the old test?
error: `ba` should be `by`, `be`
--> ./tests/gocase/unit/scan/scan_cluster_test.go:157:30
@git-hulk The test case that reports the error by CI is copied from the old test case
SCAN MATCH non-trivial patterninscan_test.go; Why does the check pass on the old test?error: `ba` should be `by`, `be` --> ./tests/gocase/unit/scan/scan_cluster_test.go:157:30
@git-hulk The test case that reports the error by CI is copied from the old test case
SCAN MATCH non-trivial patterninscan_test.go; Why does the check pass on the old test?error: `ba` should be `by`, `be` --> ./tests/gocase/unit/scan/scan_cluster_test.go:157:30
Becuase the old file was excluded, refer https://github.com/apache/kvrocks/blob/unstable/.github/config/typos.toml#L27
@git-hulk Thank you for your patient reply. For this type of situation, can we adopt a unified strategy to mark such changes, such as creating special folders or using special naming conventions? At the same time, it can be specified in the document which processes can be skipped for those with this feature, and possible solutions can be prompted when CI checks fail. This can reduce unnecessary mental burden and avoid the need to modify the CI code because of this. My current personal suggestion is to use special naming conventions, which can maintain the original project structure while ensuring good flexibility and scalability.
@git-hulk Thank you for your patient reply. For this type of situation, can we adopt a unified strategy to mark such changes, such as creating special folders or using special naming conventions? At the same time, it can be specified in the document which processes can be skipped for those with this feature, and possible solutions can be prompted when CI checks fail. This can reduce unnecessary mental burden and avoid the need to modify the CI code because of this. My current personal suggestion is to use special naming conventions, which can maintain the original project structure while ensuring good flexibility and scalability.
Sorry for the late reply. I believe it's just a temporary workaround to pass the CI when introducing the typo check action. For the new code, I prefer changing values over excluding the new file.
@git-hulk Thank you for your patient reply. For this type of situation, can we adopt a unified strategy to mark such changes, such as creating special folders or using special naming conventions? At the same time, it can be specified in the document which processes can be skipped for those with this feature, and possible solutions can be prompted when CI checks fail. This can reduce unnecessary mental burden and avoid the need to modify the CI code because of this. My current personal suggestion is to use special naming conventions, which can maintain the original project structure while ensuring good flexibility and scalability.
Sorry for the late reply. I believe it's just a temporary workaround to pass the CI when introducing the typo check action. For the new code, I prefer changing values over excluding the new file.
I also strongly support such a proposal on the premise that these special cases can be corrected.
@git-hulk Excuse me, is this PR currently waiting for code review or does it need to pass CI first?
@git-hulk Excuse me, is this PR currently waiting for code review or does it need to pass CI first?
Yes, it requires CI to be passed before merging.
I have previously performed a rolling upgrade of the system. And build error after upgraded.
build error
ld.lld: error: undefined symbol: __atomic_compare_exchange
>>> referenced by slot_migrate.cc
>>> CMakeFiles/kvrocks_objs.dir/src/cluster/kvrocks.lto.slot_migrate.cc.o:(SlotMigrator::PerformSlotRangeMigration(std::__cxx11::basic_string<char, std::char_
traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>&, int, SlotRange const&, SyncMigrateContext*))
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
clang version
clang --version
clang version 20.1.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
gcc version
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/15.1.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --enable-languages=ada,c,c++,d,fortran,go,lto,m2,objc,obj-c++,rust,cobol --enable-bootstrap --prefix=/usr --libdir=/usr/lib --
libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://gitlab.archlinux.org/archlinux/packaging/packages/gcc/-/issues --with-build-conf
ig=bootstrap-lto --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-cet=auto --enable-checking=release --enable-clocale=gnu --enable-default-pie
--enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-libstdcxx-backtrace --enable-link-serialization=1 --enable-linker-build-id --enable
-lto --enable-multilib --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-werror
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 15.1.1 20250425 (GCC)
build cmd
./x.py build --compiler=clang -DENABLE_OPENSSL=ON -DPORTABLE=1 -DCMAKE_BUILD_TYPE=Release -j $(nproc)
My Reslove
Add list(APPEND EXTERNAL_LIBS atomic) line in CMakeLists.txt fixed.
@git-hulk If this issue needs to be addressed, you may need to first look into how to handle it. I will then consolidate everything before submitting it.
Closed by #3146.
