Missing kafka protocol traces
While working with @srimaln91, he mentioned that a Randoli Java service has missing kafka protocol traces. We enabled more verbose logging via --stirling_conn_trace_pid and --vmodule=stitcher=1,socket_trace_connector=3, but it didn't narrow down the issue any further. Further investigation needs to be conducted to track down the source of this problem.
App information (please complete the following information):
- Pixie version: v0.14.12
- K8s cluster version: Unknown
- Node Kernel version: Unknown
- Browser version: Unknown
I was able to track down the problem. The service in question uses v10 of kafka's produce Api key. "Api key" is kafka's terminology for how it versions different requests and responses. I verified this through capturing its kafka traffic with tcpdump as seen below:
Since kafka's wire protocol is versioned, Pixie's protocol parser specifies what api key versions are supported. For the produce api key, Pixie supports up to v9 while kafka's spec new latest version is 12.
This means that Pixie will ignore any produce requests > v9 since its possible that Pixie may not parse it correctly. Looking at the kafka documentation linked above, the changes from v10 to v12 only adds a new error code (TRANSACTION_ABORTABLE). I believe this means there aren't any concerns with supporting the latest api key version and I've verified that updating Pixie's max version has solved @srimaln91's issue (see the screenshow below).
It's been a while since our kafka api versions have been upgraded, so we should review all of the supported api keys and make sure they are all updated to their latest versions. The second number in these tuples corresponds to the "max version" and needs to be bumped to the most recent, safe version. For the producer case, this would mean the following change:
diff --git a/src/stirling/source_connectors/socket_tracer/protocols/kafka/common/types.h b/src/stirling/source_connectors/socket_tracer/protocols/kafka/common/types.h
index b7ade9ad7..f198efadd 100644
--- a/src/stirling/source_connectors/socket_tracer/protocols/kafka/common/types.h
+++ b/src/stirling/source_connectors/socket_tracer/protocols/kafka/common/types.h
@@ -229,7 +229,7 @@ struct APIVersionData {
// TODO(chengruizhe): Needs updating for new opcodes.
inline const absl::flat_hash_map<APIKey, APIVersionData> APIVersionMap = {
// Setting min supported version to 1 to help finding frame boundary.
- {APIKey::kProduce, {1, 9, 9}},
+ {APIKey::kProduce, {1, 12, 9}},
In addition to supporting the latest api key versions, we should add logging to these locations to make it easier to identify this problem. Debugging this situation took me longer than expected since it required a code change to identify. If there was an existing VLOG statement, this could have been quickly identified as one of the earliest investigation steps is to enable more verbose logging.
Improving our logging with something like the following should be done with upgrading our api key versions:
diff --git a/src/stirling/source_connectors/socket_tracer/protocols/kafka/common/types.h b/src/stirling/source_connectors/socket_tracer/protocols/kafka/common/types.h
index b7ade9ad7..244787d36 100644
--- a/src/stirling/source_connectors/socket_tracer/protocols/kafka/common/types.h
+++ b/src/stirling/source_connectors/socket_tracer/protocols/kafka/common/types.h
@@ -308,7 +308,11 @@ inline bool IsValidAPIKey(int16_t api_key) {
inline bool IsSupportedAPIVersion(APIKey api_key, int16_t api_version) {
auto it = APIVersionMap.find(api_key);
+ VLOG(1) << absl::Substitute("Checking if api_key $0 with version $1 is supported", api_key,
+ api_version);
if (it != APIVersionMap.end()) {
+ VLOG(1) << absl::Substitute("Supported version range: [$0, $1]", it->second.kMinVersion,
+ it->second.kMaxVersion);
return api_version >= it->second.kMinVersion && api_version <= it->second.kMaxVersion;
}
return false;