falco
falco copied to clipboard
feat(webserver): implement metrics endpoint
What type of PR is this?
Uncomment one (or more)
/kind <>lines:
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind release
Any specific area of the project related to this PR?
Uncomment one (or more)
/area <>lines:
/area build
/area engine
/area tests
/area proposals
/area CI
What this PR does / why we need it:
This PR implements the /metrics endpoint in the webserver providing currently prometheus metrics.
Which issue(s) this PR fixes:
Fixes #3131 Fixes #1772
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
feat(webserver): a metrics endpoint has been added providing prometheus metrics.
It can be optionally enabled using the new `metrics.prometheus_enabled` configuration option.
It will only be activated if the `metrics.enabled` is true as well.
/milestone 0.38.0
There's something that is escaping me with regard to the sinsp object initialization. It seems that when the webserver is started, which happens very late in the process, said sinsp objects are not fully initialized.
Samuel, pulled your branch and wanted to test drive it, but it doesn't compile. Mind fixing the build? Thank you! I'll review the changes then. In addition, we still need to agree on the design wrt refreshing the metrics, @leogr to compile a final proposal based on suggestions we have so far. Thanks.
Samuel, pulled your branch and wanted to test drive it, but it doesn't compile. Mind fixing the build? Thank you! I'll review the changes then. In addition, we still need to agree on the design wrt refreshing the metrics, @leogr to compile a final proposal based on suggestions we have so far. Thanks.
I knew I forgot to commit the cmake changes when I read your message 😅 However I missed the test build failure as I was sure that they were enabled.
All fixed now
Great, if you look at the existing stats code you see we need a bunch of Ifdefs because of wasm, macos, windows ... etc that should fix those builds, but we can do it later as well.
Builds are fixed however I am wondering which configurations are officially supported on non Linux platforms.
From the looks of it, on macOS, the non minimal build won't build.
Builds are fixed however I am wondering which configurations are officially supported on non Linux platforms.
From the looks of it, on macOS, the non minimal build won't build.
Agreed it's not very clear ... I believe this should be made more transparent in the future. Out of scope for this PR.
cc @falcosecurity/falco-maintainers please review the changes and let's try to focus on the general design. If we want this to fully land for 0.38.0 (consider the grafana chart part), we should review (and perhaps have a first merge) asap IMHO; then we can build more improvements on top of this, but we need to agree and merge the general design.
The general design seems ok to me!
cc @falcosecurity/falco-maintainers please review the changes and let's try to focus on the general design. If we want this to fully land for 0.38.0 (consider the grafana chart part), we should review (and perhaps have a first merge) asap IMHO; then we can build more improvements on top of this, but we need to agree and merge the general design.
From my perspective what is missing for a PR merge:
- Implement the outstanding metrics, see https://github.com/falcosecurity/falco/pull/3140/files#r1543632512
- Update the comments of
metricsconfig as well and point to the webserver for Prometheus support. Plus there are currently also no explanations under thewebserverconfig. On that notemetrics_enabledseems not a good name, instead proposing to useprometheus_metrics_enabled. - Address https://github.com/falcosecurity/falco/pull/3140/files#r1543745560
Anything else, like possibly also supporting fixed interval refreshes of the metrics can be deferred to a follow up PR.
@sgaist
I'm just double-checking to help you get to the finish line.
I've just tracked these pending items:
- a few metrics are still missing https://github.com/falcosecurity/falco/pull/3140/files#r1543632512
- awaiting consensus here (impl. detail) https://github.com/falcosecurity/falco/pull/3140#discussion_r1559288321
- awaiting consensus for the config option name https://github.com/falcosecurity/falco/pull/3140#discussion_r1559285656
- some CI tests are failing
Is there anything else missing? Or other blockers? :thinking:
I think there's one last issue with this, and it's in falco::app::run() in app.cpp. I mentioned earlier that I did not remember how the restarting worked and I did go check. Essentially, it waits for all "running" steps to complete and then executes some teardown steps. Problem is, the return of falco::app::actions::process_events is reached before falco::app::actions::stop_webserver. Before completing, process_events() calls inspector->close() for all registered data sources (syscalls, plugins etc) which would not completely free the inspector (so it's not a pointer ownership issue) but it will definitely invalidate many of the pointers inside, and at that point the server is still serving requests and accessing data, meaning that it may trigger an use after free when trying to collect stats and data by accessing some invalid pointer within the closed inspector.
I see two potential ways of solving this:
- maybe we could, instead of calling
inspector->close(), callinspector->stop_capture()and then close it only as a last teardown step; I'm not 100% sure about the consequences but it should be OK I think. - libs could return null pointers for every invalid piece of data on a closed inspector and the stats webserver will need to check for it, but it will spew garbage stats at that time (missing data essentially). Not a fan.
@jasondellaluce @FedeDP what do you think?
I briefly spoke with @jasondellaluce and @FedeDP about the issue highlighted above. They agree a good solution is closing inspectors as a last step. I'm proposing a patch (https://github.com/falcosecurity/falco/pull/3169) so you can rebase on top of that once it's merged. We also started discussing about the general thread safety of the stats collector but that will probably need a bit more investigation.
@sgaist few super minor change requests:
- falcosecurity_falco_falco_version_info -> correct to raw metrics name "version" as we do it for the metrics output rule, that way it should resolve to the final name
falcosecurity_falco_version_info - falcosecurity_falco_duration_timestamp_nanoseconds -> use metrics type
METRIC_VALUE_UNIT_TIME_S_COUNTso it should resolve to the final namefalcosecurity_falco_duration_seconds_total
@sgaist few super minor change requests:
* falcosecurity_falco_falco_version_info -> correct to raw metrics name "version" as we do it for the metrics output rule, that way it should resolve to the final name `falcosecurity_falco_version_info` * falcosecurity_falco_duration_timestamp_nanoseconds -> use metrics type `METRIC_VALUE_UNIT_TIME_S_COUNT` so it should resolve to the final name `falcosecurity_falco_duration_seconds_total`
All fixed, good catch !
@sgaist few super minor change requests:
* falcosecurity_falco_falco_version_info -> correct to raw metrics name "version" as we do it for the metrics output rule, that way it should resolve to the final name `falcosecurity_falco_version_info` * falcosecurity_falco_duration_timestamp_nanoseconds -> use metrics type `METRIC_VALUE_UNIT_TIME_S_COUNT` so it should resolve to the final name `falcosecurity_falco_duration_seconds_total`All fixed, good catch !
Great job @sgaist !
I want to thank you for all the effort you're putting into this, we really appreciated! :pray:
LGTM label has been added.
Now just need to check CI fixes.
LGTM label has been added.
/assign
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: FedeDP, incertum, sgaist
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [FedeDP,incertum]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Thank you very much Samuel! You did a terrific job!
:clap: