valkey
valkey copied to clipboard
Introduce `commandlog heavytraffic` to record big response packet.
Why fat log?
We already have slowlog, which keeps a record of recent queries that take relatively long time to execute. However sometimes it is not enough for debugging.
For example, we often receive questions from users like: "Why is my QPS not high, but the network flow is substantial?" or "Why did the network flow suddenly increase?" The situation is network flow suddenly increases without significant higher QPS.
The most common reason is that users are accessing big keys, like getting a long string, or using commands like keys, scan, hget, zrange to request too many elements at once. This is common among our users.
Here a new log is introduced to keep a record when a query triggers a large response (and we may call this log "fatlog"), in which the command, response size, time stamp, client name and client network address are logged. This can help to monitor traffic flow spikes and big keys. And it is a quick way to pinpoint the problem.
Alternative solution?
- Slowlog: the command with big response packet may not be slow enough to trigger slowlog.
- Big key: we can detect big keys, but we are not sure whether they are accessed at a certain time, and accessing a big key does not necessarily generate large responses.
- Client traffic statistics: we may need to locate the exact commands.
Detail introduction
Just like slowlog, fatlog is not actually logged in a file, and is accessible through a series of commands:
FATLOG GET [count]
FATLOG LEN
FATLOG RESET
127.0.0.1:6379> config set fatlog-log-bigger-than 10
OK
127.0.0.1:6379> set a 1234567890
OK
127.0.0.1:6379> get a
"1234567890"
127.0.0.1:6379> fatlog get
1) 1) (integer) 3
2) (integer) 1713516667
3) (integer) 17
4) 1) "get"
2) "a"
5) "127.0.0.1:62969"
6) ""
also a help command is provided:
FATLOG HELP
And just like slowlog, two config items are provided to set how many log items are preserved and the response size threshold of being logged.
fatlog-max-len (default value: 128)
fatlog-log-bigger-than (default value: 1MB)
About implementation
You will find that a lot changes have been appllied to slowlog.c. That is because the implementation of fatlog is basically reusing the code of slowlog, since they have similar format and are controled by similar commands. So what used to be slowlog.c now contains code of both slowlog and fatlog.
I'm not sure about this specific feature. At least from what I've seen, it's less common for a single command to produce an unexpectedly large result without it also showing up in the slowlog.
At least from what I've seen, it's less common for a single command to produce an unexpectedly large result without it also showing up in the slowlog.
This does happen. Some of our users are very sensitive to network flow spikes, and they couldn't find any clue in slowlog, so thay asked us about it.
Slowlog only records time of command executions ss. But a hash lookup and a memory copy (simple get command, even with a big response packet of 1MB) could be really quick, (compared to the default slowlog threshold, 10ms).
@CharlesChen888 Before I review your PR, could you please provide some background information about why you want to update the slowlogCommand to heavyLoadLogCommand? Because in the top comment, you mention you want to add some new commands for fatlog, but in your pr, you make some changes for slowlog files. I am confused it.
@hwware This is because fatlog shares a lot code of slowlog. So the function used to be called slowlogCommand now contains code of both slowlog command and fatlog command. I tried to think of a name meaning "both slow and fat", "heavy load" seems to cover both situation.
This does happen. Some of our users are very sensitive to network flow spikes, and they couldn't find any clue in slowlog, so they asked us about it.
Can you be more precise than this? People ask us at AWS too about issues, but we've always been able to debug this type of problem without what you proposed. What do you mean they are sensitive to "network flow spikes". We've seen a lot more issues with a sudden burst of smaller commands than this type of "single large" packet issue.
@madolson
We've seen a lot more issues with a sudden burst of smaller commands than this type of "single large" packet issue.
Yes, this is a common case and we've also seen a lot.
But there is another situation in which network flow suddenly increases without significant higher QPS, the most common reason is that users are accessing big keys, like getting a long string, or using commands like keys, scan, hget, zrange to request too many elements at once. This is common among our users. And fatlog is mainly helpful in this situation, it is a quick way to pinpoint the cause.
Codecov Report
Attention: Patch coverage is 97.91667% with 1 line in your changes missing coverage. Please review.
Project coverage is 70.16%. Comparing base (
8faf278) to head (55f1a52).
Additional details and impacted files
@@ Coverage Diff @@
## unstable #336 +/- ##
============================================
+ Coverage 70.12% 70.16% +0.04%
============================================
Files 111 111
Lines 60305 60339 +34
============================================
+ Hits 42289 42338 +49
+ Misses 18016 18001 -15
| Files | Coverage Δ | |
|---|---|---|
| src/commands.def | 100.00% <ø> (ø) |
|
| src/config.c | 78.39% <ø> (ø) |
|
| src/networking.c | 85.51% <100.00%> (+0.05%) |
:arrow_up: |
| src/server.c | 88.61% <100.00%> (+0.03%) |
:arrow_up: |
| src/server.h | 100.00% <ø> (ø) |
|
| src/slowlog.c | 97.89% <96.55%> (+0.42%) |
:arrow_up: |
I think the scenario in the top comment is very clear: "slow" and "fat" are not same. I'd love to explain it further in my understanding.
We have different types of resources like CPU, memory, network, etc. The slowlog can only record commands eat lots of CPU during the processing phase (doesn't include read/write network time), but can not record commands eat too many memory and network. For example:
- run "SET key value(10 megabytes)" command would not be recored in slowlog, since when processing it the SET command only insert the value's pointer into db dict. But that command eats huge memory in query buffer and bandwidth from network. In this case, just 1000 tps can cause 10GB/s network flow.
- run "GET key" command and the key's value length is 10 megabytes. The get command can eat huge memory in output buffer and bandwidth to network.
So, the fatlog can help users to locate these special cases.
BTW, the PR only record the fat output buffer, we should also record the fat input buffer (by argv_len_sum).
BTW, the PR only record the fat output buffer, we should also record the fat input buffer (by argv_len_sum).
+1, although I would imagine unexpected large outputs are a lot more common than unexpected large inputs. There are many surprising ways to see large output, like running KEYS *.
I guess I'm convinced about the value of this feature, however I don't believe that this should be solving the case where there is uniform data with large keys, such as only being able to hit 1k tps with 1mb items. That is trivial to debug by just doing a scan + memory usage + command stat review. That's an average throughput concern. I don't think we need a command for identifying obvious issues that could be root caused with a bit of effort or a little bit of tooling. If there are a few, commands generating most of the traffic.
Another thought, should we also accumulate this information on cmd_stats? That would maybe more useful for seeing changes in command output trends compared to a fat log?
Discussed this in a separate meeting. Everyone is generally aligned about the idea, but we need to refine the name and details.
but we need to refine the name and details.
I will try to make some time for it in this week.
I propose one command name: BigValueLog.
From above comments, this command usually is used to record the big value for one key. It is different from the slowlog, which goal is to record the entry exceeded a specified execution time. Sometimes due to cpu or disk higher usage, one entry could be recorded in slowlog, but not in fatlog (BigValueLog).
Thus, I do not think adding one more argument in slowlog as SLOWLOG GET [count] [TYPE FAT|SLOW] is a good idea. I still prefer to add a new command in Valkey.
Or if we want to extend current slowlog function, I prefer the command format as: SLOWLOG GET [count] [FAT|SLOW|ALL] Then it will cover the case where a key response with slow and fat condition.
From above comments, this command usually is used to record the big value for one key. It is different from the slowlog, which goal is to record the entry exceeded a specified execution time. Sometimes due to cpu or disk higher usage, one entry could be recorded in slowlog, but not in fatlog (BigValueLog).
Yes, this is why I prefer to create a new command, rather than adding new field to SLOWLOG.
I propose one command name: BigValueLog.
Or BigPacketLog? I am fine with all three names.
I go through the PR, most part it looks good to me, But the biggest concern for me is hard to maintenance in the future. Slowlog and Fatlog are 2 different concepts, but in this PR, they share some variables and functions, but name is still only point to slowlog (for example: slowlogCreateEntry, slowlogCommand, slowlogEntry). I worry about it takes longer time to understand the codes several year later.
How do you think @valkey-io/core-team
This new metric makes sense to me on a high level but I am very much not a fan of this new command name. I also wonder if it makes sense to fold this into the slowlog command:
SLOWLOG GETgives us the default behaviorSLOWLOG GET BY TIME==SLOWLOG GETSLOWLOG GET BY SIZE== this PRSLOWLOG GET BY XXXfor future proofing
Thoughts?
Adding type to SLOWLOG is a bit wired to me. One is a time dimension and the other is a spatial dimension, they focus on different areas, so I don't think it's a good approach.
I would not recommend using BigValueLog. Here we are concerned with network traffic, such as an MSET composed of multiple small key-value pairs creating a large input, or an MGET that has many small values making up a large output.
I think FATLOG is the most suitable name. It conveys the meaning accurately and is also concise enough. Moreover, as we can see from the examples provided by @zuiderkwast , it is also applied in other fields and carries a similar meaning to what is intended in this PR.
However, if everyone feels that this name is inappropriate, BigPacketLog would be an acceptable alternative to me.
Slowlog and Fatlog are 2 different concepts, but in this PR, they share some variables and functions, but name is still only point to slowlog
We can just rename the common code to make it look like a common lib, then make slowlog and fatlog to become two interfaces of this lib. (This is actually my original idea.)
I just realized, and maybe this should be noted ... when logging fat input queries, the packet length is c->argv_len_sum, which does not include the length of RESP protocol characters, but when logging fat output reponses, the length used is c->cmd_reply_length, which is the total length of response content and protocol characters.
Adding type to
SLOWLOGis a bit wired to me. One is a time dimension and the other is a spatial dimension, they focus on different areas, so I don't think it's a good approach.I would not recommend using
BigValueLog. Here we are concerned with network traffic, such as anMSETcomposed of multiple small key-value pairs creating a large input, or anMGETthat has many small values making up a large output.I think
FATLOGis the most suitable name. It conveys the meaning accurately and is also concise enough. Moreover, as we can see from the examples provided by @zuiderkwast , it is also applied in other fields and carries a similar meaning to what is intended in this PR.However, if everyone feels that this name is inappropriate,
BigPacketLogwould be an acceptable alternative to me.
Agree with you on not adding type to SLOWLOG. Since Fat abstract concepts are already applied in other fields, I think FatLog is good enough for our Valkey. I vote keeping it.
Slowlog and Fatlog are 2 different concepts, but in this PR, they share some variables and functions, but name is still only point to slowlog
We can just rename the common code to make it look like a common lib, then make slowlog and fatlog to become two interfaces of this lib. (This is actually my original idea.)
Thanks, pls update to a common interface and make the Slowlog and Fatlog implementation separated. I think it is easier to maintain later.
We discussed this in the core team meeting. The only remaining issue to agree about is the name. Some people don't like FAT because it has a negative connotation.
@valkey-io/core-team Please comment your favorite name and some alternatives you can accept. Then we'll see if there is a winner.
Remember: The log is not about the size of keys. It's about the command and response sizes, the client traffic.
Suggestions:
- FATLOG
- LARGELOG
- BIGLOG
- BIGPACKETLOG
- HUGELOG
- THICKLOG
- HEAVYLOG (heavy traffic?)
- ~~HYPERLOGLOG~~
- Other suggestions?
We discussed this in the core team meeting. The only remaining issue to agree about is the name. Some people don't like FAT because it has a negative connotation.
@valkey-io/core-team Please comment you're favorite name and some alternatives you can accept. Then we'll see if there is a winner.
Remember: The log is not about the size of keys. It's about the command and response sizes, the client traffic.
Suggestions:
- FATLOG
- LARGELOG
- BIGLOG
- BIGPACKETLOG
- HUGELOG
- THICKLOG
- HEAVYLOG (heavy traffic?)
- ~HYPERLOGLOG~
- Other suggestions?
I vote No1. LargeLog No2. BigLog No No 3.
+1 on "LARGELOG"
+1 on "LARGELOG"
Adjectives like "big," "huge," or "large" are all acceptable, but we need to clarify what exactly they are describing. Is it a big key, a huge value, or a large command? Otherwise, it can easily be misconstrued. We need to describe clearly what it is recording.
Terms like "bigPacketLog" or "heavyTrafficLog" would be more appropriate, if these terms are too long, then "heavylog" could also work. "Heavy" can be used to describe traffic, no one would say "big traffic," "large traffic," or "huge traffic," right?
I prefer that we use different words for the size of a key and the size of a command/response.
Another area of possible confusion could have been keys on the top-level vs keys in hashes, but here we have the terminology "key" for the top-level and "field" for hash fields. This terminology helps when we talk about these, so we don't have to explain every time what we're talking about.
I think we can reserve the word "big" for keys and use another word for commands and responses. I'm leaning towards "heavy" now, since it's about traffic. Let's consider the terminology "heavy command" and "heavy response". :thinking: .... I like it.
I did a little grep in docs and code and I find that "heavy" is almost exclusively for traffic and IO ("a write-heavy application", "heavy disk I/O"), while "large" is used for many different things ("large databases", "a very large COUNT option", etc.).
+1 on HEAVYLOG
+1 to Largelog.
Alternative being considered is HEAVYTRAFFICLOG.
After a core team meeting, we decided adding a new command COMMANDLOG with subcommands HEAVYTRAFFIC and SLOW, and then slowlog.c can be renamed to a common commandlog.c.