eve icon indicating copy to clipboard operation
eve copied to clipboard

Add vector package

Open europaul opened this issue 6 months ago • 11 comments

Description

🚀 What is Vector and why we're adding it

Vector is an observability data pipeline built in Rust. It provides a modular architecture to collect, transform, and route logs and metrics efficiently. We're adopting Vector to:

  • Preprocess log streams (e.g., filter out noisy logs, deduplicate repeated messages)
  • Throttle excessive log traffic
  • Enable dynamic and runtime-adjustable pipelines
  • Monitor pipeline performance via Prometheus metrics

🔌 Integration with newlogd

Vector is integrated as a socket-based middleware layer between newlogd and its sinks. Logs are passed through Vector before being written to disk or uploaded. This setup allows us to keep newlogd as the primary logging agent while gradually introducing Vector’s advanced capabilities without disrupting existing workflows. Check out the LOGGING.md doc to learn more.


⚙️ Dynamic configuration support

To allow runtime updates of Vector’s behavior, we support user-supplied configuration via a base64-encoded payload. The flow is as follows:

  1. A new Vector config is uploaded (base64-encoded) via global options.
  2. newlogd decodes the config and writes it to /persist/vector/config/vector.yaml.new.
  3. Vector watches for changes using inotifywait:
    • If the new config is valid, it is promoted atomically to the live config.
    • If invalid, it is discarded, and the existing config remains active.

This ensures safe, crash-free config updates without requiring a container restart.


📦 Other

  • I looked at vector's memory usage during my tests and it was never higher than 3MB
  • I had to increase ROOTFS_MAXSIZE_MB to 280MB (+10MB) to fit vector in. We can decrease it however in the future if we remove some components / functions that are replaced by vector
  • Right now we build our fork of the Vector project separately with only the necessary features https://github.com/rucoder/eve-mini-vector/

🧪 Next steps

  • Create an repo on Docker Hub for the new package
  • Integrate the package build process into it's own Dockerfile instead of using a separate repo
  • Remove the log filtering and deduplication mechanisms. They are marked as deprecated for now.

PR dependencies

Depends on https://github.com/lf-edge/eve/pull/5009

How to test and validate this PR

The old tests should be used, since the overall functionality remains the same.

I will also provide more Eden tests to test the new vector.config parameter.

Changelog notes

Added Vector as a tool to transform our logs and metrics.

PR Backports

None

Checklist

  • [x] I've provided a proper description
  • [x] I've added the proper documentation
  • [x] I've tested my PR on amd64 device
  • [ ] I've tested my PR on arm64 device
  • [x] I've written the test verification instructions
  • [x] I've set the proper labels to this PR

europaul avatar Jun 25 '25 16:06 europaul

I suggest adding a Apparmor profile for vector.

shjala avatar Jun 26 '25 15:06 shjala

I apologize for coming to this 2 days late; I had kept it right in front of me and still got delayed.

I didn't quite get where vector fits into the pipeline, what part it either is replacing, or coming in between two (or more) existing parts. There is the doc and especially the diagram, but which of those parts is performed by which component?

deitch avatar Jun 27 '25 09:06 deitch

The old tests should be used, since the overall functionality remains the same.

I would also add test scenarios for at least verify:

  1. turn the vector filtering on/off
  2. change config
  3. some new transformation

OhmSpectator avatar Jun 27 '25 09:06 OhmSpectator

I apologize for coming to this 2 days late; I had kept it right in front of me and still got delayed.

I didn't quite get where vector fits into the pipeline, what part it either is replacing, or coming in between two (or more) existing parts. There is the doc and especially the diagram, but which of those parts is performed by which component?

the top part of the diagram is vector and the bottom one is newlogd. I'll make the titles a little bigger :)

europaul avatar Jun 27 '25 09:06 europaul

I apologize for coming to this 2 days late; I had kept it right in front of me and still got delayed.

I didn't quite get where vector fits into the pipeline, what part it either is replacing, or coming in between two (or more) existing parts. There is the doc and especially the diagram, but which of those parts is performed by which component?

Nice diagram here: https://github.com/lf-edge/eve/pull/5008/commits/e82f0e541fd230468106676d1c14e515abfff69b

A perfect commit where the vector is connected is here: https://github.com/lf-edge/eve/pull/5008/commits/debe6d7ecb76d8ee58d656749a7e51abc502ccc7

If the hashes are changed, just look through the list of commits and find those ones:

  1. docs: add Vector logging documentation
  2. connect vector to newlog through sockets

OhmSpectator avatar Jun 27 '25 09:06 OhmSpectator

There is that really good "EVE Logging Flows" diagram about ⅓ of the way through LOGGING.md. Where does vector fit within that diagram? Is it a subcomponent of one of them or a new one? If new, can we modify it to add so it is clear?

deitch avatar Jun 27 '25 09:06 deitch

There is that really good "EVE Logging Flows" diagram about ⅓ of the way through LOGGING.md. Where does vector fit within that diagram? Is it a subcomponent of one of them or a new one? If new, can we modify it to add so it is clear?

Oh... Good point. LOGGONG.md should be updated within this PR, definitely. The diagram, I mean.

OhmSpectator avatar Jun 27 '25 09:06 OhmSpectator

@deitch @OhmSpectator there are no sources for the diagram that you mention, that's why I created a new diagram in the Vector section of the LOGGING.md document

europaul avatar Jun 27 '25 10:06 europaul

no sources

Ah, good point. The original commit was @naiming-zededa ; Naiming, do you have the source to the newlog diagram?

I also tried asking AI to convert it to mermaid. Here is the best I got so far. If I can improve it, I will (look at the source of the comment to see the original mermaid text):

flowchart LR
    A[containerd Processes] -->|logs| B
    C["Pillar & other services"] -->|logs| B
    B -->|logs| C
    B -->|log query| D

    E["Kernel messages (/dev/kmsg)"] -->|logs| D
    F["Syslog messages (/dev/log)"] -->|logs| D

    D -->|formatted logs| G[Temp log files\\in /persist/newlog/collect]
    D -->|gzipped logs| H[gzip log files\\in /persist/newlog/\\keepSentQueue\\devUpload\\appUpload]

    H -->|gzipped logs| I["loguploader service (https)"]
    I -->|API| J[Cloud Logging Services]

    %% Group memlogd and newlogd
    subgraph "Core Logging"
        B[memlogd Ringbuffer]
        D[newlogd container]
    end

    %% Group pillar components
    subgraph "Pillar Container"
        C
        I
    end

    %% Group volume storage
    subgraph "Pillar Volumes"
        G
        H
    end

    %% Force layout: stack memlogd above newlogd
    B -.-> D

deitch avatar Jun 27 '25 10:06 deitch

Try this mermaid one:

graph TD
    %% LEFT COLUMN (Sources)
    subgraph SR[Sources]
        A[containerd Processes]
        E["Kernel messages (/dev/kmsg)"]
        F["Syslog messages (/dev/log)"]
        C["Pillar & other services"]
    end

    B[memlogd Ringbuffer]

    %% MIDDLE COLUMN (Core Logging)
    subgraph CL[Core Logging]
        direction TB
        D[newlogd container]
    end

    %% RIGHT COLUMN (Pillar Container and Volumes)
    I["loguploader service (https)"]

    subgraph PV[Pillar Volumes]
        G[Temp log files\\in /persist/newlog/collect]
        H[gzip log files\\in /persist/newlog/\\keepSentQueue\\devUpload\\appUpload]
    end

    %% FLOW CONNECTIONS
    A -->|logs| B
    C -->|logs| B
    B -->|log query| D

    E -->|logs| D
    F -->|logs| D

    D -->|formatted logs| G
    D -->|gzipped logs| H
    H -->|gzipped logs| I
    I -->|API| J[Cloud Logging Services]

deitch avatar Jun 27 '25 10:06 deitch

@deitch isn't this diagram sufficient? docs/images/vector.drawio.png

europaul avatar Jun 27 '25 10:06 europaul

can we build it instead of pulling it?

@shjala yes, I will do it in the next iteration

Is there anyway configure it to access the API endpoint over UDS?

no, the API is just for metrics and similar stuff (and I removed it because we expose metrics through a prometheus exporter)

europaul avatar Jun 30 '25 09:06 europaul

go tests fail because the package eve-dom0-ztest is not yet published to Docker Hub and can only be found in linuxkit cache, while make test builds using docker, so it's looking for eve-dom0-ztest in docker images

europaul avatar Jun 30 '25 18:06 europaul

go tests fail because the package eve-dom0-ztest is not yet published to Docker Hub and can only be found in linuxkit cache, while make test builds using docker, so it's looking for eve-dom0-ztest in docker images

I hope @christoph-zededa has an idea about it

OhmSpectator avatar Jun 30 '25 18:06 OhmSpectator

go tests fail because the package eve-dom0-ztest is not yet published to Docker Hub and can only be found in linuxkit cache, while make test builds using docker, so it's looking for eve-dom0-ztest in docker images

I hope @christoph-zededa has an idea about it

With the help from @europaul I have this: https://github.com/lf-edge/eve/pull/5027/commits/06f12f3045d4253c879fba1dd19185f238db669b

christoph-zededa avatar Jul 01 '25 09:07 christoph-zededa

I managed to build both x86 and arm64 versions of mini-vector. Here is how I did it:

  1. Clone the original vector repo
  2. Install the cross tool like it's done in Vector's workflows
  3. Change Cargo.toml file to use the following compile flags:
[profile.release]
opt-level = "z"
debug = false
strip = true
lto = true
codegen-units = 1

and only the necessary features:

target-aarch64-unknown-linux-musl = [
  "sources-socket",
  "sources-internal_metrics",
  "transforms-logs",
  "sinks-socket",
  "sources-prometheus-scrape",
  "sinks-prometheus",
]
target-x86_64-unknown-linux-musl = [
  "sources-socket",
  "sources-internal_metrics",
  "transforms-logs",
  "sinks-socket",
  "sources-prometheus-scrape",
  "sinks-prometheus",
]
  1. Run make build-x86_64-unknown-linux-musl and make build-aarch64-unknown-linux-musl to generate vector binaries
  2. Add this Dockerfile:
FROM scratch AS target-amd64
ENV CARGO_BUILD_TARGET="x86_64-unknown-linux-musl"

FROM scratch AS target-arm64
ENV CARGO_BUILD_TARGET="aarch64-unknown-linux-musl"

FROM scratch AS target-riscv64
ENV CARGO_BUILD_TARGET="riscv64gc-unknown-linux-gnu"

FROM target-$TARGETARCH AS toolchain
COPY target/$CARGO_BUILD_TARGET/release/vector /usr/bin/vector

FROM alpine:3.21 AS runtime

COPY --from=toolchain /usr/bin/vector /usr/bin/vector
  1. Run docker buildx build --platform=linux/amd64,linux/arm64 -t paulzededa/eve-vector:0.0.5 --push . to build and push the vector base images to docker hub

Big thanks to @rene for helping figure out how to do the cross-compilation!

This is of course very hacky, but I haven't found a way yet to dockerize the build. cross needs to have access to docker and I don't think we can use docker in docker in our CI.

@rene I think the best way for now would be to really fork vector's repo, do the patching like described above and produce vector base images, that we'll later use from this vector package.

europaul avatar Jul 02 '25 14:07 europaul

@OhmSpectator I think this PR is ready to merge. I noted the following action items to be address in a follow up PR:

  • find a way to dockerize the build or create a fork repo for vector-base
  • address the Fatals - vector shouldn't fail like this and bring the system down (probably a good way is to auto-restart vector from the container's entrypoint)
  • add AppArmor profile
  • see if something can be done about the image size

The ARM builds keep failing due to 429 Too Many Requests...

europaul avatar Jul 02 '25 15:07 europaul

  • find a way to dockerize the build or create a fork repo for vector-base
  • address the Fatals - vector shouldn't fail like this and print the system down (probably a good way is to auto-restart vector from the container's entrypoint)
  • add AppArmor profile
  • see if something can be done about the image size

I like the plan. Did you store it somewhere else? It would be useful to have it in our backlog so that others can observe it.

OhmSpectator avatar Jul 02 '25 16:07 OhmSpectator

Can power failure result in truncated/corrupted files for vector? One possible place is the config file, but I don't know if vector writes and reads to other files in /persist.

@eriknordmark I think you got the only one, thanks! The other ones are just copying files around at the startup, so in case of power failure they will just try again at the next startup - no info lost, the corrupt files will be overwritten.

europaul avatar Jul 11 '25 16:07 europaul

@eriknordmark @OhmSpectator please have another look, I think I addressed all the comments

europaul avatar Jul 14 '25 19:07 europaul

Could you please rebase it on master?

OhmSpectator avatar Jul 14 '25 21:07 OhmSpectator

Could you please provide end-to-end instruction for the verification team on how it's to be used by users? Or do we have already added Eden tests for that? I want to see a scenario of user provides custom transformation, uploads it and then sees the results. End to end.

OhmSpectator avatar Jul 15 '25 12:07 OhmSpectator

Could you please provide end-to-end instruction for the verification team on how it's to be used by users? Or do we have already added Eden tests for that? I want to see a scenario of user provides custom transformation, uploads it and then sees the results. End to end.

@OhmSpectator I added end-to-end integration tests in Eden like you requested. https://github.com/lf-edge/eden/pull/1083

that was actually very useful since it helped me discover and fix a couple of bugs in this PR :) thank you very much for being persistent :heart:

europaul avatar Jul 18 '25 14:07 europaul

Could you please provide end-to-end instruction for the verification team on how it's to be used by users? Or do we have already added Eden tests for that? I want to see a scenario of user provides custom transformation, uploads it and then sees the results. End to end.

@OhmSpectator I added end-to-end integration tests in Eden like you requested. lf-edge/eden#1083

that was actually very useful since it helped me discover and fix a couple of bugs in this PR :) thank you very much for being persistent ❤️

I'm glad it helps =) Question. Regarding the PR into Eden. Will the new Eden tests start automatically as part of our Eden workflow? Or should we add them manually?

OhmSpectator avatar Jul 18 '25 15:07 OhmSpectator

Question. Regarding the PR into Eden. Will the new Eden tests start automatically as part of our Eden workflow? Or should we add them manually?

I'd say they should run automatically.

europaul avatar Jul 18 '25 15:07 europaul

I cannot get the Nvidia build done for this PR... I tried a lot... Do we have to address it, @rene, @rucoder?...

OhmSpectator avatar Jul 21 '25 10:07 OhmSpectator

I cannot get the Nvidia build done for this PR... I tried a lot... Do we have to address it, @rene, @rucoder?...

I guess the problem with a cross compiler setup for those platforms. @europaul did you consider them in the Dockerfile ?

rucoder avatar Jul 21 '25 10:07 rucoder

I cannot get the Nvidia build done for this PR... I tried a lot... Do we have to address it, @rene, @rucoder?...

I guess the problem with a cross compiler setup for those platforms. @europaul did you consider them in the Dockerfile ?

I only built vector for x86_64-unknown-linux-musl and aarch64-unknown-linux-musl. Do I need to build for another triple as well?

europaul avatar Jul 21 '25 11:07 europaul

I can just say, that it's not a problem of runners: The same runner-16 works here: https://github.com/lf-edge/eve/actions/runs/16373320810/job/46376212308?pr=5008 and does not work here: https://github.com/lf-edge/eve/actions/runs/16373320810/job/46318963950?pr=5008

OhmSpectator avatar Jul 21 '25 11:07 OhmSpectator

When it stucks, I see it stucks here:

#10 [build 5/6] RUN GO111MODULE=on CGO_ENABLED=0 go build -ldflags "-s -w -X=main.Version=v0.0.0-20250718143818-f69f29f8851a
" -mod=vendor -o /out/usr/bin/newlogd ./cmd
Error: The operation was canceled.

So, it's newlogd build.

OhmSpectator avatar Jul 21 '25 11:07 OhmSpectator