opentelemetry-network
opentelemetry-network copied to clipboard
Enable full DNS names
Description: The limit for DNS hostnames being sent to the reducer increased from 80 characters to the full 256, as according to RFC 1035.
Link to tracking Issue: https://github.com/open-telemetry/opentelemetry-network/issues/256
Testing: automatic tests
Documentation: none
- :x: - login: @kubakukis14 / name: Jakub Ráček . The commit (c1ddc5a6cc6078be3ea89d609c81e7e624c1fd44, 2776a99332a1206019719cbda90313bea7278219, acdf42cbf9830470fc9c647d35ec4644ea313332) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.
Hi @kubakukis14 , awesome you're tackling this! I see the PR is still in draft but thought I'd see if I can help out.
You probably noticed this, but just to document here, that the bpf code copies a fixed amount of bytes from the DNS packets into perf rings, and then the userspace collector parses those bytes. So this change does not need to modify eBPF code.
I don't remember whether this change would require percolating through to and within the reducer. If running this patch locally doesn't seem to output the full length from the reducer, that might be where to look.
I also triggered a build run, if helpful to your efforts
Hey @yonch, I ran into some more problems trying to test the patch out, I was hoping you could help. Assume I use the build from the unchanged public repo for the following (other than the updated java and gradle versions for building).
First is running kernel collector tests. I have build the vagrant box for ubuntu jammy64 and the necessary test targets, and ran the header tests. Those had no problem running. However running the the kernel collector test (./run-test.sh --kernel-collector-test ubuntu jammy64
according to https://github.com/open-telemetry/opentelemetry-network/blob/main/test/kernel/README.md#kernel_collector_test), I experienced timeouts. I am attaching the log for reference.
kernel-log.txt
The other problem relates to manual testing. I was hoping to run the ebpf collector pipeline in the ubuntu vagrant devbox, and get some deployments with long names to generate tcp traffic, yet I can't seem to run the ~/k8s/deploy.sh script to set up the pipeline (according to https://github.com/open-telemetry/opentelemetry-network/blob/main/dev/devbox/README.md#deploying-the-opentelemetry-ebpf-pipeline-and-optional-workloads-in-a-kubernetes-cluster-in-a-devbox-vm). I am guessing it is because the helm chart has been updated, but it was not reflected in the setup script. Are there any steps I am missing, or is there any other chart I can use?
+ microk8s helm install ebpf-net -f ebpf-net.yaml -f ebpf-net-logging-exporter.yaml --set=splunkObservability.realm=REALM --set=splunkObservability.accessToken=TOKEN splunk-otel-collector-chart/splunk-otel-collector
Error: INSTALLATION FAILED: values don't meet the specifications of the schema(s) in the following chart(s):
splunk-otel-collector:
- (root): Additional property networkExplorer is not allowed
I am not sure if I have the right approach for this, could you please give me some pointers? What is the usual scenario for developing and testing new patches for the kernel collector? Sorry for dumping this here and thank you in advance, setting it all up has been a little frustrating.
@yonch After enabling extraction of full dns names in previous commits for the kernel collector, they had now been cut off from the right. After walking through the reducer pipeline I've managed to find the spot where the truncation is happening. The place is https://github.com/open-telemetry/opentelemetry-network/blob/d7441104b8739785466b46f3d9f777c1d39917a2/reducer/matching/flow_span.cc#L266, during the creation of agg_root
. Since agg_root::role2_t
(short_string<80>) is defined in generated code, I'm not quite sure how to change it, and what changing it would imply, so I'm going to need some help with this. How do you think I should proceed?
Here's some debug prints I put around the suspected code:
role_a: nc, role_b: tcp-server-service-abcdefghijklmnopqrstuvwxyzabcdef0123456789.ebpf-net-ns.svc.cluster.local, az_a: (null), az_b: (null)
role1: nc, role2: tcp-server-service-abcdefghijklmnopqrstuvwxyzabcdef0123456789.ebpf-net-ns.svc.cl, az1: , az2:
I believe enlarging the role field should solve the issue.
On another note, this account will likely not get CLA authorised, so once we figure out this issue I'd like to move to a different account and PR and archive this convo.
Thank you for persevering, sorry for missing your comment.
The specification for generated code is in the render directory. This might be the line: https://github.com/open-telemetry/opentelemetry-network/blob/d7441104b8739785466b46f3d9f777c1d39917a2/render/ebpf_net.render#L1328
The render language generated both messages, and the code and data structures to handle them (serialization/deserialization). Changing render on the aggregation core would change only components internal to the aggregation core so is safe, go ahead see if it increases the length throughout (iirc it should).
Hope that’s what you were looking for.. please lmk if I didn’t get it right.
Thanks @yonch, that was exactly what I was looking for. I managed to get it working and tested and moved it under my CLA authorized account, here: https://github.com/open-telemetry/opentelemetry-network/pull/266
superseded by #266, closing