zeek icon indicating copy to clipboard operation
zeek copied to clipboard

Add OpenTelemetry support

Open deejgregor opened this issue 2 years ago • 4 comments

WARNING: This is very much a draft.

This is being put up now so others can experiment, comment, etc.. See the list of known issues at the end that will need to be addressed before moving this out of a draft state.

Changes

This adds OpenTelemetry infrastructure to zeek, in particular these pieces:

  • Infrastructure for tracing (zeek::trace provided by Trace.cc/Trace.h)
  • Initialization (zeek::trace::SetupTracing) and tear down (zeek::trace::EarlyShutdown/zeek::trace::Shutdown) including a (currently) always-enabled Jaeger exporter.
  • Creates spans in commonly/interesting code paths and records additional attributes where it makes sense (see below).

Major areas that are traced

  • Process lifetime ("root span"), setup (zeek::detail::setup), and teardown (terminate_bro)
  • Event creation and dispatch
  • Log creation and write (log contents are shown in JSON on the DoWrite invocation and as span events)
  • Function invocation (this heavily borrows from the Prometheus exporter for Zeek, however it can avoid some of the trickiness because I'm not doing a plugin: https://github.com/esnet/zeek-exporter)
  • Session/Connection setup/tear down
  • Packet handling
  • Reporter and set_processing_status logs are stored as span events.

Asynchronous activity that is traced a bit differently than normal

By default, where meaningful asynchronous activity happens, the trace follows that activity (in particular connections, events, and logs), linking back to functional calls where possible.

To use

  1. Install Docker. Note: on a Mac, you'll probably the commercial Docker Desktop app to do this easily because Jaeger uses UDP and things like Colima have issues with UDP traffic.
  2. Run the Jaeger all-in-image (see this for details): docker run -d --name jaeger -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 -p 5775:5775/udp -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778 -p 16686:16686 -p 14250:14250 -p 14268:14268 -p 14269:14269 -p 9411:9411 jaegertracing/all-in-one:1.32
  3. Compile and install Apache Thrift (ideally using cmake?) -- this is used for the Jaeger exporter -- Apache Thrift cmake README
  4. Compile and install opentelemetry-cpp with Jaeger support enabled (see this for details): extract archive, cd into the extracted directory, run mkdir build && cd build && cmake -DWITH_JAEGER=ON .. && make -j && make install -- note: getting main so you get support for span links will get you better data in Jaeger (or patch in the changes from this PR).
  5. Build and install Zeek
  6. Run zeek like normal (ideally let is see some traffic). Note: you need to run zeek on the same host where you're running the Docker container from number 2 above because zeek will send Jaeger UDP export records to localhost:6831.
  7. Hit http://localhost:16686/ in your web browser
  8. Search and poke around (I suggest maybe looking at the operations root span, run loop iteration, and zeek::session::Session). Also, a search for the operation WriterBackend::DoWrite and tags of path=dns shows how much work is done for simple things like DNS.

cmake submodule update

This has a submodule update to the cmake submodule to add the FindThrift.cmake file.

Known issues

  • [ ] GPL COPYRIGHT ISSUE in opentelemetry-cpp! I would suggest not distributing source/binaries based on this that include proprietary code this until this issue is resolved or worked around (it shouldn't be complex to do either, I just haven't gotten there yet).
  • [ ] Trace broker so we can see cross-process communication (this is actually what got me started with this idea).
  • [ ] Always links with OpenTelemetry today
  • [ ] Always uses Jaeger exporter today (and only the Jaeger exporter)
  • [ ] Configurability of the bits hard-coded in Trace.h
  • [ ] Always creates a lot of spans. This should be configurable and maybe sample-able.
  • [ ] Always creates some low-value spans in general use (e.g.: EventMgr::Enqueue and Event::Dispatch).
  • [ ] Log span events for zeek logging directly in the function that created them.
  • [ ] When tracing is disabled, we also lose logs in that case. It would be nice to find a better way to handle conditional tracing and be able to (optionally?) let things like logged events bubble up to higher-level spans.
  • [ ] It would be nice to (optionally) see the entire packet.
  • [ ] If you leave it running for awhile: [Error] File: /Users/dgregor/git/opentelemetry-cpp/exporters/jaeger/src/thrift_sender.cc:48[JAEGER TRACE Exporter] Append() failed: too large span
  • [ ] This very likely has memory issues because I really don't know C++. ;-)
  • [ ] Plenty more ... please add things here or note in comments, code notes, etc. on this PR.

Other/Future considerations

  1. Zeek has some other pieces for tracing/debugging/etc.. Maybe some of these frameworks can be used to improve OpenTelemetry instrumentation and/or replaced with OpenTelemetry? (Note: although I'm using the Jaeger exporter here at this point, it doesn't need to be compiled in; the base OpenTelemetry libraries have almost no external dependencies)
  2. Possibly using the OpenTelemetry metrics/logs infrastructure as they mature?

deejgregor avatar Mar 09 '22 20:03 deejgregor

@sethhall @ckreibich @awelzel since I've mentioned this to all of you, I thought you might be curious....

deejgregor avatar Mar 09 '22 20:03 deejgregor

Just to comment on this: we have seen this, it's great work! We will have to figure out more broadly how to move forward with telemetry, we have a few loose ends there currently in terms of APIs and areas of visibility. This will definitely be part of that.

rsmmr avatar Mar 29 '22 11:03 rsmmr

@rsmmr (and others) I'm happy to chat about this work and the broader sense of observability whenever it makes sense.

This has been a play project for me to learn more about Zeek and integrating OpenTelemetry into a different type of application. It's been pretty useful for me to learn on. And I think I see some ways that the OpenTelemetry API and/or library could be improved to make things like selective tracing easier for application developers.

If there are any pieces that it would help for me to work on sooner rather than later, please let me know. I think the biggest thing I'm most excited about is adding inter-process tracing via broker.

deejgregor avatar Apr 05 '22 20:04 deejgregor

Not quite sure how to proceed here, seems this will need more discussion/work to get it ready. Thinking to move into a discussion for now until somebody can pick it up. Thoughts?

rsmmr avatar Aug 11 '22 10:08 rsmmr

Closing this in favor of https://github.com/zeek/zeek/discussions/2351

timwoj avatar Aug 18 '22 18:08 timwoj