valkey
valkey copied to clipboard
[NEW] Handle spamming of custom LUA error messages in the INFO ERRORSTATS section with continued tracking of non LUA errors stats
The problem/use-case that the feature addresses
When using LUA to reply with custom error messages, the prefix of the messages be included in the INFO ERRORSTATS section as a unique entry. This can spam the INFO ALL command or any INFO command variant that includes the ERRORSTATS section.
https://github.com/redis/redis/pull/13141 address this spamming of the INFO command's ERRORSTATS section and the memory usage of the errors RAX. However, it has the following drawbacks:
- To re-enable error stats tracking, we need to use
CONFIG RESETSTATwhich will clear command stat counters and error stat counters. So, we will lose visibility in both of these. - When misuse is detected (> 128 error types), we prevent tracking of regular (non LUA) error types. e.g. MOVED, CLUSTERDOWN, etc.
127.0.0.1:6379> EVAL "return server.error_reply('3692896699 a');" 0
(error) 3692896699 a
127.0.0.1:6379> EVAL "return server.error_reply('4164681858 b');" 0
(error) 4164681858 b
127.0.0.1:6379> EVAL "return server.error_reply('5558209445 c');" 0
(error) 5558209445 c
127.0.0.1:6379> EVAL "return server.error_reply('8877681120 d');" 0
(error) 8877681120 d
127.0.0.1:6379> info errorstats
# Errorstats
errorstat_3692896699:count=1
errorstat_4164681858:count=1
errorstat_5558209445:count=1
errorstat_8877681120:count=1
Description of the feature
The proposal in this issue is to prevent tracking new error messages from LUA if the number of LUA error messages is greater than 128. Instead, we will track any additional LUA error type in a new counter: errorstat_LUA and if a non-LUA error (e.g. MOVED / CLUSTERDOWN) occurs, they will continue to be tracked as usual.
This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore functionality - as normal error messages continue to be tracked.
Based on feedback and thoughts from this discussion, I can propose a PR to support this feature.
Alternatives you've considered
NA
does this https://github.com/redis/redis/pull/13141 can somehow solve your problem?
does this https://github.com/redis/redis/pull/13141 can somehow solve your problem?
I really don't understand why Oran marked that as not a breaking change, it's possible someone had more than 128 errors. I feel like he was just trying to merge stuff before the license change.
I think this should solve the issue @KarthikSubbarao brought up though.
I really don't understand why Oran marked that as not a breaking change, it's possible someone had more than 128 errors.
yes, that is indeed a potential breaking change. I have thought about add a new configuration item and defaulting to infinity to avoid the breaking change, but Oran does't seem to mention it and probable doesn't want to add new a configuration item for this.
I feel like he was just trying to merge stuff before the license change.
Looking at it now, it may indeed be.
does this https://github.com/redis/redis/pull/13141 can somehow solve your problem?
@enjoy-binbin - (Just read the top comment) Yes, this will solve the issue of spamming INFO ERRORSTATS and reduce additional data stored on error stats.
For completeness - with https://github.com/redis/redis/pull/13141, ERRORSTATS will be disabled when misuse is detected. This means non LUA errors (which do not spam) will also be disabled until we use CONFIG RESETSTAT.
Whereas, the idea proposed in this issue was to have a config to disable custom LUA error messages while continuing to allow regular Error stats to function normally.
But the main issue we wanted to address is misuse (spamming), so this should work. Thank you
Whereas, the idea proposed in this issue was to have a config to disable custom LUA error messages while continuing to allow regular Error stats to function normally.
If we think this is useful, we can consider making this improvement - by adding onto the existing logic
Resetting all the error count will also lose visibility into other errors like MOVED etc that can help with debugging issues in the cluster.
Is it possible to record all custom errors from lua in a separate RAX (based on scriptIsRunning() and if the error originated from luaRedisErrorReplyCommand) and show the custom errors only when using INFO ERRORSTATS explicitly?
Resetting all the error count will also lose visibility into other errors like MOVED etc that can help with debugging issues in the cluster.
Yes, looks like with the current implementation (from https://github.com/redis/redis/pull/13141):
- To re-enable error stats tracking, we need to use CONFIG RESETSTAT which will clear command stat counters and error stat counters. So, we will lose visibility in both of these.
- When misuse is detected (> 128 error types), we prevent tracking of regular (non LUA) error types. e.g. MOVED, CLUSTERDOWN, etc.
These issues aside, the PR addresses the concerns we have perfectly.
Is it possible to record all custom errors from lua in a separate RAX (based on scriptIsRunning() and if the error originated from luaRedisErrorReplyCommand) and show the custom errors only when using INFO ERRORSTATS explicitly?
The proposal in this issue is to prevent tracking new error messages from LUA if the number of LUA error messages is greater than 128. Instead, we will track any additional LUA error type in a new counter: errorstat_LUA and if a non-LUA error (MOVED / CLUSTERDOWN) occurs, they will continue to be tracked as usual.
This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore functionality - as normal error messages continue to be tracked.
I like the idea of tracking lua errors in a separate RAX, however it can't just be based off of scriptIsRunning, since the script could generate normal error messages. I can't think of anything off the top of my head to tell if it was a user generated error message or not.
@KarthikSubbarao Thank you for driving this! I am reading through the PR and will provide technical comments there (if I have any) but wanted to ask here: Why is that so different than the modules case? I think this will provide solution for LUA, but modules can cause the same kind of missuse right? I feel this error stats mechanism lacks some better structure than just be based on string parsing. Maybe we should consider improving that in the future?
@ranshid Thank you. Modules are similar in that they can result in custom error messages - but I would place Modules closer to the Valkey server in that it is the responsibility of those running the server to ensure that the modules they load do not lead to spamming the ERRORSTATS section. EVAL / EVALSHA commands, however, are completely customer driven. This is why the Issue here started off focusing only on LUA.
But (if required) we can indeed add support for preventing misuse by Modules by passing a flag in the VM_ReplyWithError and VM_ReplyWithErrorFormat APIs. I can update the PR if so
I like the idea of tracking lua errors in a separate RAX, however it can't just be based off of
scriptIsRunning, since the script could generate normal error messages. I can't think of anything off the top of my head to tell if it was a user generated error message or not.
I guess that depends on the definition of "lua errors". If they are interpreted as "custom errors" then agreed but do we see value in interpreting them as the origin of the errors? Meaning any errors, both custom and normal, generated by any scripts go to its own RAX. If we go down this path, we would have to create a dedicated errorstat section just for scripts ( errorstat_script_*?). Since this is already a breaking change, I think it would be worthwhile expanding the discussion a bit.
If we go down this path, we would have to create a dedicated errorstat section just for scripts
( errorstat_script_*
I don't like this approach. I don't think we should have a special section just for scripts, since we try to treat the scripts as more or less normal clients. It also makes this definitely a break changing for users that are using this feature, as opposed to just a breaking change for users that are misusing the feature.
It also makes this definitely a break changing for users that are using this feature, as opposed to just a breaking change for users that are misusing the feature.
Good point.
I like the idea of tracking lua errors in a separate RAX, however it can't just be based off of scriptIsRunning
Looks like addReplyErrorFormatEx is what is used to record the custom error in the COB?
https://github.com/valkey-io/valkey/blob/unstable/src/script_lua.c#L620.
We could pass a flag to it to differentiate custom lua errors?
We could pass a flag to it to differentiate custom lua errors?
Yes. They could be emulating a normal error though.
We could pass a flag to it to differentiate custom lua errors?
Yes. They could be emulating a normal error though.
I wouldn't consider these "emulated" errors "normal" anymore ... I think it is fine to lump them all into "custom" errors, as far as storage is concerned so we don't lose other more important errors.
Yes, looks like with the current implementation (from https://github.com/redis/redis/pull/13141):
- To re-enable error stats tracking, we need to use CONFIG RESETSTAT which will clear command stat counters and error stat counters. So, we will lose visibility in both of these.
- When misuse is detected (> 128 error types), we prevent tracking of regular (non LUA) error types. e.g. MOVED, CLUSTERDOWN, etc.
Yes, I didn't realize that config resetstat is an admin command and may not be easy to call. And, i thought about continuing to track existing errors when the limit was reached, but in the implementation at the time, we couldn't add new error codes once the number was reached, so the code does not make a big different.
I think the difficulty here is how many error codes we can allow the script to customize. The scenes I have seen so far are all misuse, maybe there is a case that user do need to use a lot error code.
One way is that we set ERROR_STATS_NUMBER as a configuration item?
btw, what PR #500 does is that when it encounters a miuse, it ensures that other normal errors can be counted instead of being disabled. I guess i am ok with that, if there is no miuse, everyone will be happy, if there is a miuse, at least you will want to check the errorstats. People still need to call CONFIG RESETSTATS to clear the miuse errorstats.
I think the difficulty here is how many error codes we can allow the script to customize. The scenes I have seen so far are all misuse, maybe there is a case that user do need to use a lot error code.
I've only seen on the order of like 10-15 custom error codes, I think you're right that this is likely only a misuse. I suppose my concern is that I would like the "normal" errors to continue working for debug purposes. I'm kind of okay with the tradeoff that #500 proposes, we continue to report normal errors but we stop reporting LUA errors.
@enjoy-binbin Just to understand your PoV, are you okay with #500, I'll try to followup and try to get it ready to merge this week if that is the case.
I'm kind of okay with the tradeoff that https://github.com/valkey-io/valkey/pull/500 proposes, we continue to report normal errors but we stop reporting LUA errors.
yean, i am ok with that.