falco icon indicating copy to clipboard operation
falco copied to clipboard

feat(webserver): implement metrics endpoint

Open sgaist opened this issue 1 year ago • 21 comments

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind feature

/kind release

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area build

/area engine

/area tests

/area proposals

/area CI

What this PR does / why we need it:

This PR implements the /metrics endpoint in the webserver providing currently prometheus metrics.

Which issue(s) this PR fixes:

Fixes #3131 Fixes #1772

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

feat(webserver): a metrics endpoint has been added providing prometheus metrics.

It can be optionally enabled using the new `metrics.prometheus_enabled` configuration option.
It will only be activated if the `metrics.enabled` is true as well.

sgaist avatar Mar 15 '24 21:03 sgaist

/milestone 0.38.0

incertum avatar Mar 17 '24 06:03 incertum

There's something that is escaping me with regard to the sinsp object initialization. It seems that when the webserver is started, which happens very late in the process, said sinsp objects are not fully initialized.

sgaist avatar Mar 26 '24 22:03 sgaist

Samuel, pulled your branch and wanted to test drive it, but it doesn't compile. Mind fixing the build? Thank you! I'll review the changes then. In addition, we still need to agree on the design wrt refreshing the metrics, @leogr to compile a final proposal based on suggestions we have so far. Thanks.

incertum avatar Mar 27 '24 18:03 incertum

Samuel, pulled your branch and wanted to test drive it, but it doesn't compile. Mind fixing the build? Thank you! I'll review the changes then. In addition, we still need to agree on the design wrt refreshing the metrics, @leogr to compile a final proposal based on suggestions we have so far. Thanks.

I knew I forgot to commit the cmake changes when I read your message 😅 However I missed the test build failure as I was sure that they were enabled.

All fixed now

sgaist avatar Mar 28 '24 09:03 sgaist

Great, if you look at the existing stats code you see we need a bunch of Ifdefs because of wasm, macos, windows ... etc that should fix those builds, but we can do it later as well.

incertum avatar Mar 28 '24 09:03 incertum

Builds are fixed however I am wondering which configurations are officially supported on non Linux platforms.

From the looks of it, on macOS, the non minimal build won't build.

sgaist avatar Mar 28 '24 12:03 sgaist

Builds are fixed however I am wondering which configurations are officially supported on non Linux platforms.

From the looks of it, on macOS, the non minimal build won't build.

Agreed it's not very clear ... I believe this should be made more transparent in the future. Out of scope for this PR.

incertum avatar Mar 28 '24 21:03 incertum

cc @falcosecurity/falco-maintainers please review the changes and let's try to focus on the general design. If we want this to fully land for 0.38.0 (consider the grafana chart part), we should review (and perhaps have a first merge) asap IMHO; then we can build more improvements on top of this, but we need to agree and merge the general design.

FedeDP avatar Apr 02 '24 07:04 FedeDP

The general design seems ok to me!

Andreagit97 avatar Apr 02 '24 08:04 Andreagit97

cc @falcosecurity/falco-maintainers please review the changes and let's try to focus on the general design. If we want this to fully land for 0.38.0 (consider the grafana chart part), we should review (and perhaps have a first merge) asap IMHO; then we can build more improvements on top of this, but we need to agree and merge the general design.

From my perspective what is missing for a PR merge:

  • Implement the outstanding metrics, see https://github.com/falcosecurity/falco/pull/3140/files#r1543632512
  • Update the comments of metrics config as well and point to the webserver for Prometheus support. Plus there are currently also no explanations under the webserver config. On that note metrics_enabled seems not a good name, instead proposing to use prometheus_metrics_enabled.
  • Address https://github.com/falcosecurity/falco/pull/3140/files#r1543745560

Anything else, like possibly also supporting fixed interval refreshes of the metrics can be deferred to a follow up PR.

incertum avatar Apr 02 '24 10:04 incertum

@sgaist

I'm just double-checking to help you get to the finish line.

I've just tracked these pending items:

  • a few metrics are still missing https://github.com/falcosecurity/falco/pull/3140/files#r1543632512
  • awaiting consensus here (impl. detail) https://github.com/falcosecurity/falco/pull/3140#discussion_r1559288321
  • awaiting consensus for the config option name https://github.com/falcosecurity/falco/pull/3140#discussion_r1559285656
  • some CI tests are failing

Is there anything else missing? Or other blockers? :thinking:

leogr avatar Apr 11 '24 13:04 leogr

I think there's one last issue with this, and it's in falco::app::run() in app.cpp. I mentioned earlier that I did not remember how the restarting worked and I did go check. Essentially, it waits for all "running" steps to complete and then executes some teardown steps. Problem is, the return of falco::app::actions::process_events is reached before falco::app::actions::stop_webserver. Before completing, process_events() calls inspector->close() for all registered data sources (syscalls, plugins etc) which would not completely free the inspector (so it's not a pointer ownership issue) but it will definitely invalidate many of the pointers inside, and at that point the server is still serving requests and accessing data, meaning that it may trigger an use after free when trying to collect stats and data by accessing some invalid pointer within the closed inspector.

I see two potential ways of solving this:

  • maybe we could, instead of calling inspector->close(), call inspector->stop_capture() and then close it only as a last teardown step; I'm not 100% sure about the consequences but it should be OK I think.
  • libs could return null pointers for every invalid piece of data on a closed inspector and the stats webserver will need to check for it, but it will spew garbage stats at that time (missing data essentially). Not a fan.

@jasondellaluce @FedeDP what do you think?

LucaGuerra avatar Apr 16 '24 14:04 LucaGuerra

I briefly spoke with @jasondellaluce and @FedeDP about the issue highlighted above. They agree a good solution is closing inspectors as a last step. I'm proposing a patch (https://github.com/falcosecurity/falco/pull/3169) so you can rebase on top of that once it's merged. We also started discussing about the general thread safety of the stats collector but that will probably need a bit more investigation.

LucaGuerra avatar Apr 17 '24 10:04 LucaGuerra

@sgaist few super minor change requests:

  • falcosecurity_falco_falco_version_info -> correct to raw metrics name "version" as we do it for the metrics output rule, that way it should resolve to the final name falcosecurity_falco_version_info
  • falcosecurity_falco_duration_timestamp_nanoseconds -> use metrics type METRIC_VALUE_UNIT_TIME_S_COUNT so it should resolve to the final name falcosecurity_falco_duration_seconds_total

incertum avatar Apr 23 '24 11:04 incertum

@sgaist few super minor change requests:

* falcosecurity_falco_falco_version_info -> correct to raw metrics name "version" as we do it for the metrics output rule, that way it should resolve to the final name `falcosecurity_falco_version_info`

* falcosecurity_falco_duration_timestamp_nanoseconds -> use metrics type `METRIC_VALUE_UNIT_TIME_S_COUNT` so it should resolve to the final name `falcosecurity_falco_duration_seconds_total`

All fixed, good catch !

sgaist avatar Apr 24 '24 07:04 sgaist

@sgaist few super minor change requests:

* falcosecurity_falco_falco_version_info -> correct to raw metrics name "version" as we do it for the metrics output rule, that way it should resolve to the final name `falcosecurity_falco_version_info`

* falcosecurity_falco_duration_timestamp_nanoseconds -> use metrics type `METRIC_VALUE_UNIT_TIME_S_COUNT` so it should resolve to the final name `falcosecurity_falco_duration_seconds_total`

All fixed, good catch !

Great job @sgaist !

I want to thank you for all the effort you're putting into this, we really appreciated! :pray:

leogr avatar Apr 24 '24 07:04 leogr

LGTM label has been added.

Git tree hash: 7266865d0b743e48936d17fda8c04dabcbaf4973

poiana avatar Apr 24 '24 09:04 poiana

Now just need to check CI fixes.

incertum avatar Apr 24 '24 09:04 incertum

LGTM label has been added.

Git tree hash: b0bf6a5ca9cc69fc00fd9bae353af6d4009cb5fd

poiana avatar Apr 29 '24 08:04 poiana

/assign

leogr avatar Apr 30 '24 16:04 leogr

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: FedeDP, incertum, sgaist

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • ~~OWNERS~~ [FedeDP,incertum]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

poiana avatar May 03 '24 09:05 poiana

Thank you very much Samuel! You did a terrific job!

FedeDP avatar May 03 '24 09:05 FedeDP

:clap:

leogr avatar May 08 '24 08:05 leogr