kvrocks icon indicating copy to clipboard operation
kvrocks copied to clipboard

fix(scan): SCAN MATCH may incorrectly skip certain keys in some case

Open Genuineh opened this issue 6 months ago • 15 comments

It closes #3023.

Since there are no comments or related tests, I don't know what this old code does, but it seems to be for acceleration. However, it appears that when the key is not found and the HASH_SLOTS_MAX_ITERATIONS limit is triggered, this might cause some keys not to be scanned, leading to the 3023 bug.

    if (slot_id > slot_start + HASH_SLOTS_MAX_ITERATIONS) {
      if (keys->empty()) {
        if (iter->Valid()) {
          std::tie(std::ignore, user_key) = ExtractNamespaceKey<std::string>(iter->key(), storage_->IsSlotIdEncoded());
          auto res = std::mismatch(prefix.begin(), prefix.end(), user_key.begin());
          if (res.first == prefix.end() && util::StringMatch(suffix_glob, user_key.substr(prefix.size()))) {
            keys->emplace_back(user_key);
          }

          end_cursor->append(user_key);
        }
      } else {
        end_cursor->append(user_key);
      }

The main modifications are the addition of tests and changes to redis_db.cc, while others are caused by ./x.py format or test packages change.

For the testing, I copied most of the previous scan tests, created a file to retest under the condition of a (2*master-slave) cluster, and added the use of the go-redis clusterclient and its related code.

Additional Questions

1.In Go tests, it may be possible to use the UniversalClient to unify the clients for both cluster and single-node modes. 2.In the use cases other than replication, it seems that testing under the cluster scheme has not been attempted, which may pose risks. The implementation of some commands like scan may be related to the cluster to a certain extent.

@git-hulk Please review the code, thank you.

Genuineh avatar Jun 11 '25 10:06 Genuineh

@PragmaTwice please close this and i will create a new pr, thanks. If there are other precautions, you can let me know. I'm using Linux + Neovim.

Genuineh avatar Jun 11 '25 10:06 Genuineh

You can push new commits to this branch instead of closing it. (we'll squash them while merging.)

Also you can use ./x.py format with clang-format-14 to format your code.

PragmaTwice avatar Jun 11 '25 10:06 PragmaTwice

@PragmaTwice These code-style changes seem to be caused by executing ./x.py format.

Genuineh avatar Jun 11 '25 10:06 Genuineh

exec ./x.py format in 8a50e51c image

Genuineh avatar Jun 11 '25 10:06 Genuineh

@PragmaTwice Then can I not execute the format?

Genuineh avatar Jun 11 '25 10:06 Genuineh

Could you confirm that version of your clang-format is 14?

We use this version in our CI, and it will make the CI fail if the format behavior is different.

PragmaTwice avatar Jun 11 '25 10:06 PragmaTwice

@PragmaTwice Sorry, my clang-format version is 20.1.4, Let me see how to use clang-format-14 to format the code.

Genuineh avatar Jun 11 '25 10:06 Genuineh

You can download it here: https://github.com/llvm/llvm-project/releases/tag/llvmorg-14.0.6

PragmaTwice avatar Jun 11 '25 11:06 PragmaTwice

@PragmaTwice Should the changes to the go.sum file be reverted?

Genuineh avatar Jun 11 '25 11:06 Genuineh

@PragmaTwice I found the scan match xxxx much slower than before version in some case. I need to find the reason about it, Please do not merge

Genuineh avatar Jun 12 '25 11:06 Genuineh

@git-hulk @PragmaTwice Hi, I have a question about the string set in single mode that why some characters miss after AppendNamespacePrefix

rocksdb::Status String::Set(engine::Context &ctx, const std::string &user_key, const std::string &value,
                            StringSetArgs args, std::optional<std::string> &ret) {
  uint64_t expire = 0;
  info("Set key: {}, value: {}, expire: {}, type: {}, get: {}, keep_ttl: {}", user_key, value, args.expire,
       static_cast<int>(args.type), args.get, args.keep_ttl);
  std::string ns_key = AppendNamespacePrefix(user_key);
  info("Set key with namespace: {}, value: {}, expire: {}, type: {}, get: {}, keep_ttl: {}", ns_key, value, args.expire,
       static_cast<int>(args.type), args.get, args.keep_ttl);

  bool need_old_value = args.type != StringSetType::NONE || args.get || args.keep_ttl;
  if (need_old_value) {
    std::string old_value;
    uint64_t old_expire = 0;
    auto s = getValueAndExpire(ctx, ns_key, &old_value, &old_expire);
    if (!s.ok() && !s.IsNotFound() && !s.IsInvalidArgument()) return s;
    // GET option
    if (args.get) {
      if (s.IsInvalidArgument()) {
        return s;
      }
      if (s.IsNotFound()) {
        // if GET option given, the key didn't exist before: return nil
        ret = std::nullopt;
      } else {
        // if GET option given: return The previous value of the key.
        ret = std::make_optional(old_value);
      }
    }
    // NX/XX option
    if (args.type == StringSetType::NX && s.ok()) {
      // if NX option given, the key already exist: return nil
      if (!args.get) ret = std::nullopt;
      return rocksdb::Status::OK();
    } else if (args.type == StringSetType::XX && s.IsNotFound()) {
      // if XX option given, the key didn't exist before: return nil
      if (!args.get) ret = std::nullopt;
      return rocksdb::Status::OK();
    } else {
      // if GET option not given, make ret not nil
      if (!args.get) ret = "";
    }
    if (s.ok() && args.keep_ttl) {
      // if KEEPTTL option given, use the old ttl
      expire = old_expire;
    }
  } else {
    // if no option given, make ret not nil
    if (!args.get) ret = "";
  }

  // Handle expire time
  if (!args.keep_ttl) {
    expire = args.expire;
  }

  // Create new value
  std::string new_raw_value;
  Metadata metadata(kRedisString, false);
  metadata.expire = expire;
  metadata.Encode(&new_raw_value);
  new_raw_value.append(value);
  return updateRawValue(ctx, ns_key, new_raw_value);
}

image

Genuineh avatar Jun 16 '25 03:06 Genuineh

@git-hulk @PragmaTwice Hi, I have a question about the string set in single mode that why some characters miss after AppendNamespacePrefix

rocksdb::Status String::Set(engine::Context &ctx, const std::string &user_key, const std::string &value,
                            StringSetArgs args, std::optional<std::string> &ret) {
  uint64_t expire = 0;
  info("Set key: {}, value: {}, expire: {}, type: {}, get: {}, keep_ttl: {}", user_key, value, args.expire,
       static_cast<int>(args.type), args.get, args.keep_ttl);
  std::string ns_key = AppendNamespacePrefix(user_key);
  info("Set key with namespace: {}, value: {}, expire: {}, type: {}, get: {}, keep_ttl: {}", ns_key, value, args.expire,
       static_cast<int>(args.type), args.get, args.keep_ttl);

  bool need_old_value = args.type != StringSetType::NONE || args.get || args.keep_ttl;
  if (need_old_value) {
    std::string old_value;
    uint64_t old_expire = 0;
    auto s = getValueAndExpire(ctx, ns_key, &old_value, &old_expire);
    if (!s.ok() && !s.IsNotFound() && !s.IsInvalidArgument()) return s;
    // GET option
    if (args.get) {
      if (s.IsInvalidArgument()) {
        return s;
      }
      if (s.IsNotFound()) {
        // if GET option given, the key didn't exist before: return nil
        ret = std::nullopt;
      } else {
        // if GET option given: return The previous value of the key.
        ret = std::make_optional(old_value);
      }
    }
    // NX/XX option
    if (args.type == StringSetType::NX && s.ok()) {
      // if NX option given, the key already exist: return nil
      if (!args.get) ret = std::nullopt;
      return rocksdb::Status::OK();
    } else if (args.type == StringSetType::XX && s.IsNotFound()) {
      // if XX option given, the key didn't exist before: return nil
      if (!args.get) ret = std::nullopt;
      return rocksdb::Status::OK();
    } else {
      // if GET option not given, make ret not nil
      if (!args.get) ret = "";
    }
    if (s.ok() && args.keep_ttl) {
      // if KEEPTTL option given, use the old ttl
      expire = old_expire;
    }
  } else {
    // if no option given, make ret not nil
    if (!args.get) ret = "";
  }

  // Handle expire time
  if (!args.keep_ttl) {
    expire = args.expire;
  }

  // Create new value
  std::string new_raw_value;
  Metadata metadata(kRedisString, false);
  metadata.expire = expire;
  metadata.Encode(&new_raw_value);
  new_raw_value.append(value);
  return updateRawValue(ctx, ns_key, new_raw_value);
}

image

@Genuineh, I'm sorry for not quite understanding what the characters are missing.

git-hulk avatar Jun 16 '25 04:06 git-hulk

@git-hulk Sorry, it's my problem. I have blurry vision.

Genuineh avatar Jun 16 '25 05:06 Genuineh

@git-hulk The real cause of this bug, which resulted from a misunderstanding in the previous modification, should be the logical omission in the filtering of keys containing slot information under the cluster mode, leading to matching issues of keys with different prefixes in the same slot. This submission mainly adds the adaptation of matching strategies in related scenarios and relevant tests under the cluster environment. Preliminary simple performance tests have shown no obvious performance degradation. Due to the lack of a suitable performance benchmark testing project, larger-scale performance tests have not been carried out. In addition, as I am not familiar with C++, I am not sure whether the string allocation and matching strategies are optimal. Please focus on this, as string operations may be a performance-consuming part in loops. Thank you for your patience.

Genuineh avatar Jun 16 '25 05:06 Genuineh

@git-hulk I found a new problem of scan when input the empty prefix in unstable branch. The go test case like this

	t.Run("SCAN MATCH for empty prefix", func(t *testing.T) {
		require.NoError(t, rdb.FlushDB(ctx).Err())
		util.Populate(t, rdb, "key:", 10, 10)
		keys := scanAll(t, rdb, "match", "")
		require.Len(t, keys, 10)
	})

error

=== RUN   TestScanWithNumberCursor/SCAN_MATCH_for_empty_prefix
    /home/jerryg/github/forked/kvrocks/tests/gocase/unit/scan/scan_test.go:101:
                Error Trace:    /home/jerryg/github/forked/kvrocks/tests/gocase/unit/scan/scan_test.go:101
                Error:          "[]" should have 10 item(s), but has 0
                Test:           TestScanWithNumberCursor/SCAN_MATCH_for_empty_prefix

Need to fix it? Under normal circumstances, it should be rare for the input to be empty. But in Redis, it seems that this is supported. From the perspective of Redis compatibility, it is best to fix this issue, which occurs in both standalone and cluster scenarios. If necessary, I think a new merge request should be opened to address this issue after this submission is merged.

Genuineh avatar Jun 16 '25 06:06 Genuineh

@git-hulk The test case that reports the error by CI is copied from the old test case SCAN MATCH non-trivial pattern in scan_test.go; Why does the check pass on the old test?

error: `ba` should be `by`, `be`
  --> ./tests/gocase/unit/scan/scan_cluster_test.go:157:30

Genuineh avatar Jul 02 '25 06:07 Genuineh

@git-hulk The test case that reports the error by CI is copied from the old test case SCAN MATCH non-trivial pattern in scan_test.go; Why does the check pass on the old test?

error: `ba` should be `by`, `be`
  --> ./tests/gocase/unit/scan/scan_cluster_test.go:157:30

@git-hulk The test case that reports the error by CI is copied from the old test case SCAN MATCH non-trivial pattern in scan_test.go; Why does the check pass on the old test?

error: `ba` should be `by`, `be`
  --> ./tests/gocase/unit/scan/scan_cluster_test.go:157:30

Becuase the old file was excluded, refer https://github.com/apache/kvrocks/blob/unstable/.github/config/typos.toml#L27

git-hulk avatar Jul 02 '25 15:07 git-hulk

@git-hulk Thank you for your patient reply. For this type of situation, can we adopt a unified strategy to mark such changes, such as creating special folders or using special naming conventions? At the same time, it can be specified in the document which processes can be skipped for those with this feature, and possible solutions can be prompted when CI checks fail. This can reduce unnecessary mental burden and avoid the need to modify the CI code because of this. My current personal suggestion is to use special naming conventions, which can maintain the original project structure while ensuring good flexibility and scalability.

Genuineh avatar Jul 03 '25 01:07 Genuineh

@git-hulk Thank you for your patient reply. For this type of situation, can we adopt a unified strategy to mark such changes, such as creating special folders or using special naming conventions? At the same time, it can be specified in the document which processes can be skipped for those with this feature, and possible solutions can be prompted when CI checks fail. This can reduce unnecessary mental burden and avoid the need to modify the CI code because of this. My current personal suggestion is to use special naming conventions, which can maintain the original project structure while ensuring good flexibility and scalability.

Sorry for the late reply. I believe it's just a temporary workaround to pass the CI when introducing the typo check action. For the new code, I prefer changing values over excluding the new file.

git-hulk avatar Jul 07 '25 07:07 git-hulk

@git-hulk Thank you for your patient reply. For this type of situation, can we adopt a unified strategy to mark such changes, such as creating special folders or using special naming conventions? At the same time, it can be specified in the document which processes can be skipped for those with this feature, and possible solutions can be prompted when CI checks fail. This can reduce unnecessary mental burden and avoid the need to modify the CI code because of this. My current personal suggestion is to use special naming conventions, which can maintain the original project structure while ensuring good flexibility and scalability.

Sorry for the late reply. I believe it's just a temporary workaround to pass the CI when introducing the typo check action. For the new code, I prefer changing values over excluding the new file.

I also strongly support such a proposal on the premise that these special cases can be corrected.

Genuineh avatar Jul 07 '25 11:07 Genuineh

@git-hulk Excuse me, is this PR currently waiting for code review or does it need to pass CI first?

Genuineh avatar Jul 30 '25 02:07 Genuineh

@git-hulk Excuse me, is this PR currently waiting for code review or does it need to pass CI first?

Yes, it requires CI to be passed before merging.

git-hulk avatar Aug 02 '25 02:08 git-hulk

I have previously performed a rolling upgrade of the system. And build error after upgraded.

build error

ld.lld: error: undefined symbol: __atomic_compare_exchange
>>> referenced by slot_migrate.cc
>>>               CMakeFiles/kvrocks_objs.dir/src/cluster/kvrocks.lto.slot_migrate.cc.o:(SlotMigrator::PerformSlotRangeMigration(std::__cxx11::basic_string<char, std::char_
traits<char>, std::allocator<char>> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>&, int, SlotRange const&, SyncMigrateContext*))
clang++: error: linker command failed with exit code 1 (use -v to see invocation) 

clang version

clang --version 
clang version 20.1.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

gcc version

COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/15.1.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --enable-languages=ada,c,c++,d,fortran,go,lto,m2,objc,obj-c++,rust,cobol --enable-bootstrap --prefix=/usr --libdir=/usr/lib --
libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://gitlab.archlinux.org/archlinux/packaging/packages/gcc/-/issues --with-build-conf
ig=bootstrap-lto --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-cet=auto --enable-checking=release --enable-clocale=gnu --enable-default-pie
 --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-libstdcxx-backtrace --enable-link-serialization=1 --enable-linker-build-id --enable
-lto --enable-multilib --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-werror
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 15.1.1 20250425 (GCC)

build cmd

./x.py build --compiler=clang -DENABLE_OPENSSL=ON -DPORTABLE=1 -DCMAKE_BUILD_TYPE=Release -j $(nproc) 

My Reslove

Add list(APPEND EXTERNAL_LIBS atomic) line in CMakeLists.txt fixed.

@git-hulk If this issue needs to be addressed, you may need to first look into how to handle it. I will then consolidate everything before submitting it.

Genuineh avatar Aug 04 '25 01:08 Genuineh

Closed by #3146.

PragmaTwice avatar Aug 27 '25 13:08 PragmaTwice