opentelemetry-network icon indicating copy to clipboard operation
opentelemetry-network copied to clipboard

Enable full DNS names

Open kubakukis14 opened this issue 10 months ago • 5 comments

Description: The limit for DNS hostnames being sent to the reducer increased from 80 characters to the full 256, as according to RFC 1035.

Link to tracking Issue: https://github.com/open-telemetry/opentelemetry-network/issues/256

Testing: automatic tests

Documentation: none

kubakukis14 avatar Apr 06 '24 23:04 kubakukis14

CLA Not Signed

Hi @kubakukis14 , awesome you're tackling this! I see the PR is still in draft but thought I'd see if I can help out.

You probably noticed this, but just to document here, that the bpf code copies a fixed amount of bytes from the DNS packets into perf rings, and then the userspace collector parses those bytes. So this change does not need to modify eBPF code.

I don't remember whether this change would require percolating through to and within the reducer. If running this patch locally doesn't seem to output the full length from the reducer, that might be where to look.

yonch avatar Apr 09 '24 16:04 yonch

I also triggered a build run, if helpful to your efforts

yonch avatar Apr 09 '24 16:04 yonch

Hey @yonch, I ran into some more problems trying to test the patch out, I was hoping you could help. Assume I use the build from the unchanged public repo for the following (other than the updated java and gradle versions for building).

First is running kernel collector tests. I have build the vagrant box for ubuntu jammy64 and the necessary test targets, and ran the header tests. Those had no problem running. However running the the kernel collector test (./run-test.sh --kernel-collector-test ubuntu jammy64 according to https://github.com/open-telemetry/opentelemetry-network/blob/main/test/kernel/README.md#kernel_collector_test), I experienced timeouts. I am attaching the log for reference. kernel-log.txt

The other problem relates to manual testing. I was hoping to run the ebpf collector pipeline in the ubuntu vagrant devbox, and get some deployments with long names to generate tcp traffic, yet I can't seem to run the ~/k8s/deploy.sh script to set up the pipeline (according to https://github.com/open-telemetry/opentelemetry-network/blob/main/dev/devbox/README.md#deploying-the-opentelemetry-ebpf-pipeline-and-optional-workloads-in-a-kubernetes-cluster-in-a-devbox-vm). I am guessing it is because the helm chart has been updated, but it was not reflected in the setup script. Are there any steps I am missing, or is there any other chart I can use?

+ microk8s helm install ebpf-net -f ebpf-net.yaml -f ebpf-net-logging-exporter.yaml --set=splunkObservability.realm=REALM --set=splunkObservability.accessToken=TOKEN splunk-otel-collector-chart/splunk-otel-collector
Error: INSTALLATION FAILED: values don't meet the specifications of the schema(s) in the following chart(s):
splunk-otel-collector:
- (root): Additional property networkExplorer is not allowed

I am not sure if I have the right approach for this, could you please give me some pointers? What is the usual scenario for developing and testing new patches for the kernel collector? Sorry for dumping this here and thank you in advance, setting it all up has been a little frustrating.

kubakukis14 avatar Apr 16 '24 15:04 kubakukis14

@yonch After enabling extraction of full dns names in previous commits for the kernel collector, they had now been cut off from the right. After walking through the reducer pipeline I've managed to find the spot where the truncation is happening. The place is https://github.com/open-telemetry/opentelemetry-network/blob/d7441104b8739785466b46f3d9f777c1d39917a2/reducer/matching/flow_span.cc#L266, during the creation of agg_root. Since agg_root::role2_t (short_string<80>) is defined in generated code, I'm not quite sure how to change it, and what changing it would imply, so I'm going to need some help with this. How do you think I should proceed?

Here's some debug prints I put around the suspected code:

role_a: nc, role_b: tcp-server-service-abcdefghijklmnopqrstuvwxyzabcdef0123456789.ebpf-net-ns.svc.cluster.local, az_a: (null), az_b: (null)
role1: nc, role2: tcp-server-service-abcdefghijklmnopqrstuvwxyzabcdef0123456789.ebpf-net-ns.svc.cl, az1: , az2:

I believe enlarging the role field should solve the issue.

On another note, this account will likely not get CLA authorised, so once we figure out this issue I'd like to move to a different account and PR and archive this convo.

kubakukis14 avatar Apr 26 '24 15:04 kubakukis14

Thank you for persevering, sorry for missing your comment.

The specification for generated code is in the render directory. This might be the line: https://github.com/open-telemetry/opentelemetry-network/blob/d7441104b8739785466b46f3d9f777c1d39917a2/render/ebpf_net.render#L1328

The render language generated both messages, and the code and data structures to handle them (serialization/deserialization). Changing render on the aggregation core would change only components internal to the aggregation core so is safe, go ahead see if it increases the length throughout (iirc it should).

Hope that’s what you were looking for.. please lmk if I didn’t get it right.

yonch avatar May 03 '24 11:05 yonch

Thanks @yonch, that was exactly what I was looking for. I managed to get it working and tested and moved it under my CLA authorized account, here: https://github.com/open-telemetry/opentelemetry-network/pull/266

kubakukis14 avatar May 16 '24 14:05 kubakukis14

superseded by #266, closing

yonch avatar May 21 '24 16:05 yonch