ydb icon indicating copy to clipboard operation
ydb copied to clipboard

HC add alert on cyan storage groups

Open StekPerepolnen opened this issue 1 year ago • 2 comments

Verification for the pool - if all groups are cyan, then initiate the check (ORANGE).

Verification for the pool - if there is at least one yellow group (YELLOW).

StekPerepolnen avatar Oct 02 '24 12:10 StekPerepolnen

va-kuznecov and stunder are agreed to smth like that (exact numbers can vary)

if groups_with_cyan_flag_count > total_groups_in_pool * 0.5 YELLOW if groups_with_cyan_flag_count > total_groups_in_pool * 0.9 ORANGE if groups_with_cyan_flag_count == total_groups_in_pool RED

StekPerepolnen avatar Oct 21 '24 08:10 StekPerepolnen

Basically, there is a parameter in the config that determines CYAN: https://github.com/ydb-platform/ydb/blob/main/ydb/core/protos/blobstorage_pdisk_config.proto#L93

It can be used in the health check to decide whether there is an issue with disk space on the pdisk or with the groups and pools associated with it.

  • It wouldn’t be very precise because CYAN is defined a bit differently.
  • However, high precision isn't crucial since the system's performance doesn’t depend on a parameter like 50% (90%) cyan groups.
  • In the health check, it would be reasonable to report that there is less than 13% (by default) space left on the pdisk, and the user would understand this.
  • For the problematic pool, as intended, we can report that storage expansion is required.

StekPerepolnen avatar Oct 21 '24 08:10 StekPerepolnen

The proposal is as follows:

On the pdisk, continue sending the message "Available size is less than X%" as before, but now X will depend on the ChunkBaseLimit. Example of the message: "YELLOW Available size is less than 12.4%." The user won't know where this number comes from, but from the context, it will be clear that the space is running low.

At the vdisk and group levels, the messages will remain the same, but I'll add a counter for groups with space issues – this will be useful for new messages at the pool level.

At the pool level, in addition to the old messages, add new ones: "Storage requires an increase in capacity." YELLOW – number of groups with space issues > 50% ORANGE – number of groups with space issues > 90% RED – number of groups with space issues > 100% No reason will be provided for the new messages.

StekPerepolnen avatar Oct 22 '24 14:10 StekPerepolnen

Мы уже сталкивались с проблемами, когда сообщения про место трудно правильно трактовать, потому что из сообщения не ясно, о каком именно месте речь и что именно привело к срабатыванию, постоянно путаемся между "место в группах кончилось" и "кончилась квота в схемшарде". Поэтому следует очень аккуратно выбрать слова в сообщениях о том что кончается место, так, чтобы было совершенно однозначно понятно, что именно и где именно является причиной появления алерта.

the-ancient-1 avatar Oct 24 '24 17:10 the-ancient-1

At the level of p-disks and v-disks, there is nothing specific to report — this information does not provide value to the client.

At the level of groups and storage, report new messages depending on the color of groups in sys_view.

At the GROUP level, add new messages: YELLOW - group cyan: Soon, the group will be full, on the verge of allowing writes. YELLOW - group lightyellow: The group is full, on the verge of allowing writes. ORANGE - group yellow: The group is full; writing is not possible.

Group-level messages should not be displayed if the pool is stable.

At the POOL level, add new messages to existing ones: Storage requires an increase in capacity. YELLOW - number of groups cyan or worse > 90% ORANGE - number of groups lightyellow or worse > 90% RED - number of groups yellow or worse 100%

StekPerepolnen avatar Oct 30 '24 08:10 StekPerepolnen

https://github.com/ydb-platform/ydb/issues/11531 - bsc issue

StekPerepolnen avatar Nov 12 '24 14:11 StekPerepolnen

rfc https://github.com/ydb-platform/ydb-rfc/blob/main/hc_storage_space.md

StekPerepolnen avatar Nov 18 '24 08:11 StekPerepolnen