ydb icon indicating copy to clipboard operation
ydb copied to clipboard

Implement RecomputeKMeans scan at the data shard side (#19154)

Open vitalif opened this issue 6 months ago • 6 comments

Changelog category

  • Not for changelog (changelog entry is not required)

Description for reviewers

vitalif avatar Jun 19 '25 11:06 vitalif

:green_circle: 2025-06-19 11:43:06 UTC The validation of the Pull Request description is successful.

github-actions[bot] avatar Jun 19 '25 11:06 github-actions[bot]

:white_circle: 2025-06-19 11:44:14 UTC Pre-commit check linux-x86_64-relwithdebinfo for b0e90c82fa22b9ecff7469d0e4b240c0842822db has started. :white_circle: 2025-06-19 11:46:12 UTC Artifacts will be uploaded here :white_circle: 2025-06-19 11:50:26 UTC ya make is running... :yellow_circle: 2025-06-19 13:20:14 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
38688 35982 0 5 2664 37

:white_circle: 2025-06-19 13:23:37 UTC ya make is running... (failed tests rerun, try 2) :yellow_circle: 2025-06-19 13:37:58 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
521 (only retried tests) 483 0 1 3 34

:white_circle: 2025-06-19 13:38:08 UTC ya make is running... (failed tests rerun, try 3) :red_circle: 2025-06-19 13:50:47 UTC Some tests failed, follow the links below.

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
321 (only retried tests) 288 0 1 1 31

:green_circle: 2025-06-19 13:50:55 UTC Build successful. :yellow_circle: 2025-06-19 13:51:19 UTC ydbd size 2.2 GiB changed* by +309.8 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 3559a1f6ca846e99682e83d699ef89748d7b8b74 merge: b0e90c82fa22b9ecff7469d0e4b240c0842822db diff diff %
ydbd size 2 375 957 328 Bytes 2 376 274 552 Bytes +309.8 KiB +0.013%
ydbd stripped size 497 918 472 Bytes 497 968 968 Bytes +49.3 KiB +0.010%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

github-actions[bot] avatar Jun 19 '25 11:06 github-actions[bot]

:white_circle: 2025-06-19 11:44:17 UTC Pre-commit check linux-x86_64-release-asan for b0e90c82fa22b9ecff7469d0e4b240c0842822db has started. :white_circle: 2025-06-19 11:47:18 UTC Artifacts will be uploaded here :white_circle: 2025-06-19 11:52:06 UTC ya make is running... :yellow_circle: 2025-06-19 13:59:09 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
16292 15947 0 127 190 28

:white_circle: 2025-06-19 14:00:33 UTC ya make is running... (failed tests rerun, try 2) :yellow_circle: 2025-06-19 14:39:22 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1848 (only retried tests) 1571 0 68 183 26

:white_circle: 2025-06-19 14:39:42 UTC ya make is running... (failed tests rerun, try 3) :yellow_circle: 2025-06-19 15:13:38 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1602 (only retried tests) 1336 0 63 174 29

:green_circle: 2025-06-19 15:13:55 UTC Build successful. :yellow_circle: 2025-06-19 15:14:32 UTC ydbd size 3.9 GiB changed* by +514.0 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 3559a1f6ca846e99682e83d699ef89748d7b8b74 merge: b0e90c82fa22b9ecff7469d0e4b240c0842822db diff diff %
ydbd size 4 179 408 072 Bytes 4 179 934 400 Bytes +514.0 KiB +0.013%
ydbd stripped size 1 448 934 136 Bytes 1 449 086 584 Bytes +148.9 KiB +0.011%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

github-actions[bot] avatar Jun 19 '25 11:06 github-actions[bot]

:white_circle: 2025-06-19 16:25:15 UTC Pre-commit check linux-x86_64-release-asan for 82213f6553a36a5b25cc8115f5854607728a3e29 has started. :white_circle: 2025-06-19 16:25:27 UTC Artifacts will be uploaded here :white_circle: 2025-06-19 16:29:23 UTC ya make is running...

github-actions[bot] avatar Jun 19 '25 16:06 github-actions[bot]

:white_circle: 2025-06-19 16:25:15 UTC Pre-commit check linux-x86_64-relwithdebinfo for 82213f6553a36a5b25cc8115f5854607728a3e29 has started. :white_circle: 2025-06-19 16:25:27 UTC Artifacts will be uploaded here :white_circle: 2025-06-19 16:29:22 UTC ya make is running...

github-actions[bot] avatar Jun 19 '25 16:06 github-actions[bot]

:white_circle: 2025-06-19 16:32:31 UTC Pre-commit check linux-x86_64-relwithdebinfo for 1daf3eba01956dfa089b717ad82df0aeb9d920f2 has started. :white_circle: 2025-06-19 16:32:43 UTC Artifacts will be uploaded here :white_circle: 2025-06-19 16:36:49 UTC ya make is running... :yellow_circle: 2025-06-19 18:26:54 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
38697 35984 0 10 2664 39

:white_circle: 2025-06-19 18:30:25 UTC ya make is running... (failed tests rerun, try 2) :yellow_circle: 2025-06-19 18:43:51 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1508 (only retried tests) 1451 0 1 22 34

:white_circle: 2025-06-19 18:44:08 UTC ya make is running... (failed tests rerun, try 3) :red_circle: 2025-06-19 18:55:38 UTC Some tests failed, follow the links below.

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
467 (only retried tests) 436 0 1 0 30

:green_circle: 2025-06-19 18:55:46 UTC Build successful. :yellow_circle: 2025-06-19 18:56:10 UTC ydbd size 2.2 GiB changed* by +308.7 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 911050b753e7fb532f13324f65d2a167e8045e37 merge: 1daf3eba01956dfa089b717ad82df0aeb9d920f2 diff diff %
ydbd size 2 376 264 336 Bytes 2 376 580 472 Bytes +308.7 KiB +0.013%
ydbd stripped size 497 966 248 Bytes 498 016 648 Bytes +49.2 KiB +0.010%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

github-actions[bot] avatar Jun 19 '25 16:06 github-actions[bot]

:white_circle: 2025-06-20 11:55:39 UTC Pre-commit check linux-x86_64-relwithdebinfo for ac13b260003a4439db1067cc94255809c4c28bb7 has started. :white_circle: 2025-06-20 11:55:51 UTC Artifacts will be uploaded here :white_circle: 2025-06-20 11:59:49 UTC ya make is running... :yellow_circle: 2025-06-20 14:00:22 UTC Some tests failed, follow the links below. Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
38700 35994 0 2 2666 38

:white_circle: 2025-06-20 14:03:57 UTC ya make is running... (failed tests rerun, try 2) :green_circle: 2025-06-20 14:15:54 UTC Tests successful.

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
404 (only retried tests) 372 0 0 4 28

:green_circle: 2025-06-20 14:16:05 UTC Build successful. :yellow_circle: 2025-06-20 14:16:27 UTC ydbd size 2.2 GiB changed* by +308.8 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: ce274a43eeccc20b30ffb022268320258c813209 merge: ac13b260003a4439db1067cc94255809c4c28bb7 diff diff %
ydbd size 2 376 655 024 Bytes 2 376 971 224 Bytes +308.8 KiB +0.013%
ydbd stripped size 498 045 256 Bytes 498 095 720 Bytes +49.3 KiB +0.010%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

github-actions[bot] avatar Jun 20 '25 11:06 github-actions[bot]

:white_circle: 2025-06-20 11:55:41 UTC Pre-commit check linux-x86_64-release-asan for ac13b260003a4439db1067cc94255809c4c28bb7 has started. :white_circle: 2025-06-20 11:55:52 UTC Artifacts will be uploaded here :white_circle: 2025-06-20 11:59:50 UTC ya make is running... :yellow_circle: 2025-06-20 14:34:13 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
16302 15936 0 138 198 30

:white_circle: 2025-06-20 14:35:44 UTC ya make is running... (failed tests rerun, try 2) :yellow_circle: 2025-06-20 15:16:57 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
2217 (only retried tests) 1940 0 72 180 25

:white_circle: 2025-06-20 15:17:20 UTC ya make is running... (failed tests rerun, try 3) :yellow_circle: 2025-06-20 15:50:43 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
1610 (only retried tests) 1347 0 68 172 23

:green_circle: 2025-06-20 15:51:00 UTC Build successful. :yellow_circle: 2025-06-20 15:51:34 UTC ydbd size 3.9 GiB changed* by +1000.3 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 614aa01a05784522b0fd706b73301924765690f2 merge: ac13b260003a4439db1067cc94255809c4c28bb7 diff diff %
ydbd size 4 180 197 304 Bytes 4 181 221 600 Bytes +1000.3 KiB +0.025%
ydbd stripped size 1 449 180 920 Bytes 1 449 471 832 Bytes +284.1 KiB +0.020%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

github-actions[bot] avatar Jun 20 '25 11:06 github-actions[bot]

Упали тесты векторного индекса в kqp_scheme_ut из-за несоответствия длин векторов в индексируемой колонке и параметров индекса. Там создавали векторный индекс по полю Name с именами типа Anna, Joshua и т.п., при этом использовали параметры: тип - float, длина вектора - 1024.

Сначала стал исправлять TSampleKMeansScan, добавляя в него пропуск векторов с некорректным размером. Но не захотелось это изменение замешивать в данном ПР. Поэтому просто поправил тестовые данные и параметры индекса - сделал все "имена" длиной 4 байта и параметры вектора - uint8 и длина вектора 3 - это как раз 4 байта, 1 лишний - это байт с типом.

Раньше, кстати, в подобных ситуациях вообще мог быть overflow. Там в ReshuffleKMeansScan раньше принимались кластеры со строками любой длины, а дальше в FindCluster искался кластер, используя cluster.data() без проверки границ. Скорее всего, вообще мог быть выход за границы массива.

Кроме того, на самом деле я вообще считаю неправильным позволять строить векторный индекс по некорректным данным. Ранее это свободно разрешалось, а поведение при этом было неопределённым - при дальнейших поисках такие данные могли либо не найтись вообще, либо свалиться все в 1 кластер и всё-таки найтись.

Но тут надо отдельно обсудить и поправить - тикет на это я уже создавал, т.к. уже ранее нарывался при тестировании: https://github.com/ydb-platform/ydb/issues/18667

vitalif avatar Jun 20 '25 12:06 vitalif

Кроме того, на самом деле я вообще считаю неправильным позволять строить векторный индекс по некорректным данным.

а если построили по пустой таблице, потом вставляют не то?)

kungasc avatar Jun 20 '25 12:06 kungasc

Кроме того, на самом деле я вообще считаю неправильным позволять строить векторный индекс по некорректным данным.

а если построили по пустой таблице, потом вставляют не то?)

ну как бы, по-хорошему, колонка вообще должна быть типизированная и не давать вставить "не то")

vitalif avatar Jun 20 '25 12:06 vitalif