valkey Limit custom LUA error stats while allowing non LUA error stats to function normally

Implementing the change proposed here: https://github.com/valkey-io/valkey/issues/487

In this PR, we prevent tracking new error messages from LUA if the number of error messages (in the errors RAX) is greater than 128. Instead, we will track any additional LUA error type in a new counter: errorstat_LUA_ERRORSTATS_DISABLED and if a non-LUA error (e.g. MOVED / CLUSTERDOWN) occurs, they will continue to be tracked as usual.

This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore error stats functionality because normal error messages continue to be tracked.

Example:

# Errorstats
.
.
.
errorstat_127:count=2
errorstat_128:count=2
errorstat_ERR:count=1
errorstat_LUA_ERRORSTATS_DISABLED:count=2

May 15 '24 08:05 KarthikSubbarao

(Force pushed because this was recommended by the bot here since I did not include the commit sign off originally. Also because it is not reviewed yet)

May 15 '24 09:05 KarthikSubbarao

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 70.25%. Comparing base (752b6ee) to head (08e97ba). Report is 19 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #500      +/-   ##
============================================
+ Coverage     70.20%   70.25%   +0.04%     
============================================
  Files           111      112       +1     
  Lines         60242    60587     +345     
============================================
+ Hits          42295    42567     +272     
- Misses        17947    18020      +73

Files	Coverage Δ
src/networking.c	`88.75% <100.00%> (+3.38%)`	:arrow_up:
src/script_lua.c	`90.15% <100.00%> (+0.01%)`	:arrow_up:
src/server.c	`88.47% <ø> (-0.10%)`	:arrow_down:
src/server.h	`100.00% <ø> (ø)`

... and 33 files with indirect coverage changes

May 18 '24 13:05 codecov[bot]

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation. For example lets include function and redis.call cases ARE getting into errorStats.

May 19 '24 10:05 ranshid

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation. For example lets include function and redis.call cases ARE getting into errorStats.

@ranshid - When functions are used and contain LUA code with server.error_reply with custom error messages, they will still be caught by this new logic when we are past the 128 limit. If the error is not from LUA (e.g. syntax error in server.call), it will continue to be tracked

Did you mean add test cases where functions (with lua) are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW when the errors RAX is over the limit?

Example:

for ((i=1; i<=128; i++))                                                            
do
    ./valkey-cli EVAL "return server.error_reply('$i a');" 0
done

% cat customerr.lua 
#!lua name=mylib
redis.register_function(
  'custom_error',
  function() return server.error_reply('customerror 0') end
)

cat customerr.lua | ./valkey-cli -x FUNCTION LOAD REPLACE

% ./valkey-cli                                                                        
127.0.0.1:6379> fcall custom_error 0
(error) customerror 0
127.0.0.1:6379> info errorstats
errorstat_1:count=1
.
.
.
errorstat_128:count=1
errorstat_LUA_ERRORSTATS_OVERFLOW:count=1

May 20 '24 00:05 KarthikSubbarao

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation. For example lets include function and redis.call cases ARE getting into errorStats.

@ranshid - When functions are used and contain LUA code with server.error_reply with custom error messages, they will still be caught by this new logic when we are past the 128 limit. If the error is not from LUA (e.g. syntax error in server.call), it will continue to be tracked

Did you mean add test cases where functions (with lua) are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW when the errors RAX is over the limit?

Example:
for ((i=1; i<=128; i++))                                                            
do
    ./valkey-cli EVAL "return server.error_reply('$i a');" 0
done
% cat customerr.lua 
#!lua name=mylib
redis.register_function(
  'custom_error',
  function() return server.error_reply('customerror 0') end
)
cat customerr.lua | ./valkey-cli -x FUNCTION LOAD REPLACE 
% ./valkey-cli                                                                        
127.0.0.1:6379> fcall custom_error 0
(error) customerror 0
127.0.0.1:6379> info errorstats
errorstat_1:count=1
.
.
.
errorstat_128:count=1
errorstat_LUA_ERRORSTATS_OVERFLOW:count=1

@KarthikSubbarao I think this is somewhat problematic. Functions are more like modules IMO and I think we should allow function errors to overflow. I think maybe we can flag the client (or check if we are in the context of eval/evalsha) in order to enforce the overflow?

May 20 '24 05:05 ranshid

@KarthikSubbarao I think this is somewhat problematic. Functions are more like modules IMO and I think we should allow function errors to overflow. I think maybe we can flag the client (or check if we are in the context of eval/evalsha) in order to enforce the overflow?

Functions are still using LUA and are using the same APIs such as server.error_reply to reply with custom errors. Because of this, we can still get into the spamming error section output with functions - as we do with EVAL/EVALSHA.

To handle this, when a server exceeds the limit, when functions are used with additional custom errors, they are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW.

IMO, this behavior makes sense - but I am also curious to hear from others

May 20 '24 15:05 KarthikSubbarao

Sorry for the force push. I did not include the sign off on the previous commit and the DCO check required this

May 20 '24 21:05 KarthikSubbarao

@valkey-io/core-team Since @PingXie mentioned in the other thread about this, I thought it would be good to try to get consensus on this change. The tl;dr is that in Redis it was possible to spam the error_stat log with custom errors generated from lua scripts, because you can return arbitrary errors from lua scripts. @enjoy-binbin implemented a work around that would clear the errostat radix tree if it's size got above a certain threshold. In order for a user to resume seeing errors, they would need to to call config resetstat. However, this is an admin command, and restricted on many systems, so a user might not be able to get out of this state easily. So instead this change stops reporting errors from lua scripts above a certain threshold.

Ping suggested:

I guess that depends on the definition of "lua errors". If they are interpreted as "custom errors" then agreed but do we see value in interpreting them as the origin of the errors? Meaning any errors, both custom and normal, generated by any scripts go to its own RAX. If we go down this path, we would have to create a dedicated errorstat section just for scripts ( errorstat_script_*?). Since this is already a breaking change, I think it would be worthwhile expanding the discussion a bit.

I think this is important to get right for 8, so let's settle this here if possible.

Jun 25 '24 02:06 madolson

@valkey-io/core-team Since @PingXie mentioned in the other thread about this, I thought it would be good to try to get consensus on this change. The tl;dr is that in Redis it was possible to spam the error_stat log with custom errors generated from lua scripts, because you can return arbitrary errors from lua scripts. @enjoy-binbin implemented a work around that would clear the errostat radix tree if it's size got above a certain threshold. In order for a user to resume seeing errors, they would need to to call config resetstat. However, this is an admin command, and restricted on many systems, so a user might not be able to get out of this state easily. So instead this change stops reporting errors from lua scripts above a certain threshold.

Ping suggested:
I guess that depends on the definition of "lua errors". If they are interpreted as "custom errors" then agreed but do we see value in interpreting them as the origin of the errors? Meaning any errors, both custom and normal, generated by any scripts go to its own RAX. If we go down this path, we would have to create a dedicated errorstat section just for scripts ( errorstat_script_*?). Since this is already a breaking change, I think it would be worthwhile expanding the discussion a bit.
I think this is important to get right for 8, so let's settle this here if possible.

I got convinced by you on errorstat_script_* being a more breaking change :-).

Jun 25 '24 05:06 PingXie

From our meeting this morning, we are directionally approved to update the behavior here but we'll decide locally in the PR for whatever makes sense.

Jul 01 '24 21:07 madolson

Sorry for the late consensus binding, binbin mentioned in the other thread that he was good with this approach and I just had a small comment ontop of Viktors. Once they are addressed this should be good to merge.

Jul 09 '24 01:07 madolson

This run might have failed on a flaky test: https://github.com/valkey-io/valkey/actions/runs/9868728468/job/27251202927?pr=500

I re-ran it locally multiple times and the tests all passed

Jul 10 '24 06:07 KarthikSubbarao