falco icon indicating copy to clipboard operation
falco copied to clipboard

[Tracking] Opening multiple event sources in the same Falco instance

Open jasondellaluce opened this issue 2 years ago • 19 comments

Motivation

The plugin system allows Falco to open new kinds of event sources that go beyond the historical syscall use case. Recently, this has been leveraged to port the k8s audit log event source to a plugin (see: https://github.com/falcosecurity/plugins/tree/master/plugins/k8saudit, and https://github.com/falcosecurity/falco/pull/1952). One of the core limitation that comes from the plugin system implementation of the libraries, is that a given Falco instance is capable of opening only one event source. In the example above, this implies that Falco instances are not able to ingest both syscalls and k8s audit logs together. This can instead be accomplished by deploying two distinct Falco instances, one for each event source.

Feature Requirements

  • (R1) A single Falco instance should become able to open more than one event source at once and in parallel
  • (R2) There should be feature parity and performance parity between having 2+ source active in parallel in a single Falco instances and having 2+ single-source Falco instances with the same event sources

Proposed Solution

Release Goals

To be defined. This is out of reach for Falco 0.32.1.

Terminology

  • Capture Mode: A configuration of sinsp inspectors that reads event from a trace file
  • Live Mode: A configuration of sinsp inspectors that reads event from one of the supported modes (kmod, ebpf, gvisor, plugin)

Design

  • (D1) The feature is implemented in Falco only, and mostly only affects the codebase of falcosecurity/falco. Both libsinsp and libscap will keep working in single-source mode
  • (D2) Falco manages multiple sinsp instances, one in each thread
  • (D3) Falco manages one or more instances of sinsp inspectors
    • If the # of inspectors is 1, everything runs in the main thread just like now
    • If the # of inspectors is 2+, each inspector runs in its own a separate thread (see (R1)). The whole event data path happens in parallel within each thread (event production, data enrichment, event-rule matching, and output formatting)
  • (D4) If in capture mode, Falco runs a only 1 inspector configured to read events from a trace file
  • (D5) If in live mode, Falco runs 1 inspector for each active event source
    • If an event source terminates due to EOF being reached, Falco waits for the other event sources to terminate too
    • If an event source terminates with an error, Falco forces the termination of all the other event sources
  • (D6) There is 1 instance of the Falco Rule Engine (just like now), and we leverage/enforce thread-safety guarantees to make sure it is safe and non-blocking for different threads to perform event-rule matching
  • (D7) There is 1 instance of the Falco Output Engine (just like now), and we leverage/enforce thread-safety guarantees to make sure it is safe for different threads to send alerts when an event-rule match is found
    • Non-blocking guarantees are less of a concern here, because the number of alerts is orders of magnitudes lower than the number of events

Technical Limitations of the Design

  • (L1) There cannot be 2+ event sources with the same name active at the same time
    • This would defeat the thread-safety guarantees of the Rule Engine, which are based on the notion of event source partitioning
    • Potential Workarounds (for the future, just in case):
      • Have more than one instances of the Rule Engine to handle the increased event source cardinality. For example, the second Rule Engine instance would cover all the second event source replicas, the third Rule Engine instance will handle the third replicas, and so on
      • We make the Rule Engine thread safe without the event source <-> thread 1-1 mapping assumptions. This is hardly achievable, because this would imply making the whole filtercheck system of libsinsp thread-safe too. Another naive solution would be to create one mutex for each event source to protect the access to the Rule Engine. In both scenarios, this would be hard to manage and performance would be sub-optimal
      • We have one Rule Engine for each source, which could become harder to manager. For example, rule files would need to be loaded by all the rule engines, which makes the initialization phase and hot-reloading slower too. However, this is something we can consider for the future.
  • (L2) Filterchecks cannot be shared across different event sources to guarantee thread-safety in the Rule Engine. The direct implication is that if an plugin with extractor capability is compatible with 2+ active event sources (e.g. json can extract from both aws_cloudtrail and k8s_audit), we need to create and initialize two different instances of the plugin (1 for each inspector)
    • Practically, this means that a given plugin instance will always extract fields coming from the same event source (a.k.a. subsequent calls to plugin_extract_fields will never receive events from two distinct event sources for the same initialized pluginstate)
    • This limitation can actually be turned as a by-design feature, because doing the contrary would violate (R2)
    • Potential Workarounds (for the future, just in case):
      • Make field extraction thread-safe (hardly doable, see points in (L1)

Technical Blockers

This is the list of things we mandatorily need to work on to see this initiative happen.

  • [x] (B1) The rule engine <-> inspector source index mapping needs to be handle in different ways for capture mode and live mode
    • In capture mode, the rule engine source index is the same as the source index in the plugin manager of the single inspector used in capture mode (with the exception of the syscall source, which is by convention the last source index after all the plugin ones)
    • In live mode, each rule engine source index should be uniquely assigned to each inspector in live mode that runs in its own thread
    • https://github.com/falcosecurity/falco/pull/2182
  • [x] (B2) Plugins can potentially be loaded multiple times in order to be registered to each live mode inspector
    • In live mode, a single plugin with field extraction capability can be registered to all the inspectors configured with an event source they are compatible with
      • Note: This also applies to plugins with both field extraction and event sourcing capabilities. In this case, the plugin is register and used both its capabilities only with the inspector in which its event source is active, whereas it is just registered for its field extraction capability in all other event-source-compatible inspectors.
    • https://github.com/falcosecurity/falco/pull/2182
  • [x] (B3) The plugin API and the Plugin SDK Go should be revised to support multi-thread and concurrency assumptions (will likely be the only change needed outside of falcosecurity/falco)
    • [x] Plugin API:
      • Most API symbols will need to support being called concurrently, with every call having distinct ss_plugin_t*s
      • https://github.com/falcosecurity/libs/pull/547
    • [x] Plugin SDK Go:
      • Tracking the discussion in another issue: https://github.com/falcosecurity/plugin-sdk-go/issues/62
        • https://github.com/falcosecurity/plugin-sdk-go/pull/65
  • [x] (B4) Print-only Falco actions (e.g. list fields, list events, etc...) are dependent on the app state inspector. These needs to be stateless and can be implemented by allocating a sinsp inspector on-the-fly because they just access static information
    • https://github.com/falcosecurity/falco/pull/2097
  • [x] (B5) The Falco StatsWriter (-s option) is not thread-safe
    • https://github.com/falcosecurity/falco/pull/2109
  • [x] (B6) The Falco Rule Engine is does not provide any thread-safety guarantee
    • https://github.com/falcosecurity/falco/pull/2081 (non-blocker)
    • https://github.com/falcosecurity/falco/pull/2082
  • [x] (B7) Signal-based actions (termination, restart, and output reopening) are not thread safe
    • https://github.com/falcosecurity/falco/pull/2091
  • [x] (B8) The Falco Output framework is not entirely thread safe
    • https://github.com/falcosecurity/falco/pull/2080
    • https://github.com/falcosecurity/falco/pull/2139
  • [x] (B9) Libsinsp and libscap have a bunch of global and static variables that limit our freedom of having multiple inspectors running in parallel (g_infotables, g_logger, g_initializer, g_filterlist, g_decoderlist, s_cri_xxx, libsinsp::grpc_channel_registry::s_channels, s_callback, g_event_info, g_ppm_events, g_chisel_dirs, g_chisel_initializer, g_syscall_code_routing_table, g_syscall_table, g_syscall_info_table, g_json_error_log)
    • This may seem like a lot, but we should good to go as-is because most of these are read-only tables or objects. The ones that actually bundle some logic are either thread-safe (g_logger), or used only by inspectors running the syscall event source. Since due to (L1) we don't allow two inspectors running the syscall source at the same time, it should be safe to assume that no concurrent access will happen to the syscall-related globals.
  • [x] (B10) The whole Falco application logic should be revised to support multiple inspectors, multiple filtercheck factories, and to distinguish the capture-mode and live-mode use cases as defined in the Design section. This will required all previous (BXX) points to be satisfied first
    • https://github.com/falcosecurity/falco/pull/2182

Nice to Have

  • [x] (N1) Add a new -–enable-source=xxx option as a dual to -–disable-source=xxx. This design implies that the active event sources are chosen in an opt-out fashion: every loaded source gets activated, with the exception of the disabled ones. The -–enable-source option will make the UX better to define the only source users want to activate
    • https://github.com/falcosecurity/falco/pull/2085
  • [x] (N2) Improve the regression testing framework falco_test.py to support selecting the active event source. Without it, all non-syscall tests will hang or fail, because the syscall event source is implicitly activated along with the testing-subject one and will cause Falco to not terminate (example: k8s audit tests)
    • https://github.com/falcosecurity/falco/pull/2085
  • [x] (N3) Reduce threadiness of /healtz webserver (based on cpp-httplib). The webserver library documents that the default threadiness is 8 or std::thread::hardware_concurrency(). This is OK, but since we go on a multi-threaded model we should consider limiting the number of threads spawned by Falco. In my 8-core setup, I Falco spawned 30 threads (only few were active, luckily) with a simple test with syscalls and 2 Go plugins loaded.
    • https://github.com/falcosecurity/falco/pull/2090

Linked Discussions

  • https://kubernetes.slack.com/archives/CMWH3EH32/p1655762074442099
  • https://kubernetes.slack.com/archives/CMWH3EH32/p1655209558876649
  • https://kubernetes.slack.com/archives/CMWH3EH32/p1647969834426869
  • https://kubernetes.slack.com/archives/CMWH3EH32/p1646148516502339?thread_ts=1646061690.259599&cid=CMWH3EH32
  • https://kubernetes.slack.com/archives/CMWH3EH32/p1645610199932789?thread_ts=1645396079.322259&cid=CMWH3EH32
  • https://kubernetes.slack.com/archives/CMWH3EH32/p1645038583559959?thread_ts=1645034669.067289&cid=CMWH3EH32
  • https://github.com/falcosecurity/falco/issues/2110

jasondellaluce avatar Jun 20 '22 08:06 jasondellaluce

I think this is something we should aim for over the next few months. So, I propose to turn this issue into a tracking issue where I'll share the though process of my proposed solution and all the steps and PRs involved. I expect this to be the place in which we can have discussions about this solution, and perhaps converge to something we all agree on.

/milestone 0.33.0

jasondellaluce avatar Jun 22 '22 08:06 jasondellaluce

I'm gonna share the thought process that led me to the proposed solution documented in the issue description. @leogr has been involved in the early discussion of this topic.

Since the end goal is to run multiple sources at the same time, many execution models have been hypothesized, but each of them has been discarded in favor of the proposed solution. I'm gonna proceed and document briefly their design and cost/benefits tradeoffs.

  • (H1) Single-threaded model, with multiple event sources opened by libscap
    • This would be simple to implement, but would require lots of refactorings of the libraries
    • This would not work in the general case, because performance would be tremendously degraded depending on the event consumption policy of libscap (e.g. round-robin)
    • Each event source would be blocked by the other ones. Since there are no buffering or multi-threading, every event production request is sequential. This means that "slow" event sources would become the bottleneck for all the other ones. In the case of syscalls, this would be totally unacceptable in terms of event drop metrics
  • (H2) Multi-threaded model, with parallelization at the libscap level: libscap would become able to open multiple event sources in parallel and consume all of them from a single thread
    • Not too hard to implement
    • "Slow" event sources would not impact the others anymore
    • Producing events in parallel would require a queue-like synchronization system, similar to our syscall ringbuffer, which would also require backpressure control support and memory copies. Both of these would degrade performance to a non-acceptable level
  • (H3) Multi-threaded model, with parallelization at the libsinsp level: libsinsp manages multiple scap instances in parallel
    • Costs and benefits are totally analoguous to (H2). In this case, libsinsp would be come the sequential bottleneck

Reasonably, going up in the stack the next synchronization point would be Falco itself, which is exactly the solution we proposed. This is the only design we came up with that ensure total and non-blocking parallelization of each event source processing routine. The only sequential bottleneck becomes the Falco Output Engine, which is natively designed to be thread-safe(~ish), and has the benefits that the number of alerts is orders of magnitude lower than the number of events produced by each event source (syscalls in particular), which ensure acceptable performance and negligible synchronization overhead.

jasondellaluce avatar Jun 22 '22 10:06 jasondellaluce

Nice feature and write-up, @jasondellaluce!

For D6, have you thought about issues of fairness, etc. when the rate of system calls far exceeds the rate of other log sources? I think that some queue management will likely be necessary to avoid contention/blocking. A thread pool in the rules engine could also help with throughput, at the expense of allowing events to be exported out of order.

araujof avatar Jun 22 '22 16:06 araujof

@araujof thanks for reviewing, those are all good points!

For D6, we can achieve good thread safety guarantees from the Rule Engine with the assumption that every thread gets assigned different event source. This stands before the Rule Engine is entirely partitioned by event source. See this for more details: https://github.com/falcosecurity/falco/pull/2082. Assuming that the Rule Engine level can provide wait-free concurrent access, I'd say fairness should be guaranteed at that level.

Then, once alerts are triggered they go through the Output Engine, where fairness and ordering are a big topic instead. Luckily, thanks to @leogr we already have a concurrent ordered queue there. There may be a fairness point if the queue gets filled-up with alerts coming from one event source, but that's an unlikely scenario (If Falco fills up the queue, it means that either the outputs are all blocked, or that some rule is really noisy). A simple workaround would be to increase the queue size. We only have one worker on the other side of the queue though, and we might consider using a thread pool maybe in the future.

What do you think? Does this address your concerns?

jasondellaluce avatar Jun 22 '22 19:06 jasondellaluce

Hi @jasondellaluce,

What do you think? Does this address your concerns?

Sounds good to me! Thanks for the pointer to #2082.

araujof avatar Jun 23 '22 13:06 araujof

Thanks for the very thorough write up! I completely agree that this is the best/least-bad approach to take. Here are some more specific comments:

(D1): you're right that only falco will be opening multiple inspectors. But (B9) is going to result in a lot of changes to make opening multiple inspectors possible.

(D4): outside of test cases/ad-hoc testing, I don't think anyone would run falco on a trace file, right? Might want to treat this as a special case and not try too hard to support multiple inspectors.

(B3): I don't recall the implementation, but why can't you partition around handles or inspectors to avoid making the async readers concurrent?

(B6): Any concurrency improvements will be limited to partitioning by event source, right?

(B8): Instead of removing the token bucket, you could move it to the other side of the outputs queue, right?

(B9): You're right that these might be safe because they are read-only, but I think this would be a great time to move them into the inspector, simply as a cleanup step. I think there are some other potential globals like the dns manager singleton that need to be cleaned up as well.

mstemm avatar Jun 29 '22 22:06 mstemm

Hey Mark, responding point by point:

  • I tried to make deep research on this, and my opinion is that (B9) is not a blocker for (D1) with the assumption that only one inspector opens the syscall event source. With this assumption, all the static and globals in libsinsp are either read-only (e.g. all the tables), used only for syscalls (e.g. container engine), or thread-safe (e.g. g_logger). So as long as we have (L1) we should be good to go for this I think
  • I agree, capture files should be the core focus of this but we still need to support the feature. This case is quite easy to manage as it only has to use one inspector, it's just handled differently than the live mode case
  • The issue with async extractor is that we have only 1 async worker running for extraction. So far, we only had one ss_plugin_t* active per time for each plugin, so this was a single-consumer-single-producer model. Now that a given plugin can be initialize 2+ times at the same time (e.g. json extracts fields from 2 different event sources), this can in general become a multiple-consumer-single-producer model, which the code is not ready to handle. The approach you're proposing here seems like the Hard proposed solution that I listed (1 worker for each plugin instance), which is OK but requires lots of testing to make sure all those workers don't end up consuming tons of CPU time. I think option 1 or 2 of the proposed solutions might be a good start for the first release of this feature
  • Right, partitioning the engine by event source like we already do should be enough. The only exception is the stats_manager internal class, which would be shared across all the sources. But that's an easy fix, because we can turn each counter into an atomic one with relaxed memory order guarantee. Moreover, this should not impact performance because it gets into play only when a rule is triggered. See #2082 for more context.
  • Yes, I could move the token bucket on the other side of the queue, but this would mean sharing it across all the active event sources, which opens the door to many security concerns. Let's say that we have a very noisy event source+ruleset, we risk that alerts coming from other event sources get discarded due to the shared rate limiter. I strongly advocate that event sources should not interfere with each other as much as possible, also because this would assume a different behavior if the same event sources are run in distinct Falco instances (and we want to achieve feature parity). Besides, the rate limiter has been indicated many times as risky for the scope of security, so I advocate that dropping it is the cleanest solution here.
  • This also connects with your first point. I totally agree with this, removing static and global variables from libsinsp is a cleanup we should really look up for. However, I expect it to be a good cleanup effort because many of them are very eradicated in many parts of the code (e.g. g_filterchecklist). However, the way this feature is designed aims to make its development orthogonal to this kind of cleanup. In this way, both efforts can proceed without blocking each other.

jasondellaluce avatar Jun 30 '22 08:06 jasondellaluce

Hi Jason, sorry, I'm late to the party, so take these comments with a grain of salt. Since the falco rules engine is pretty stateless, would it be easier to just build one engine per source? One could mitigate some of the hot loading issues by compiling the rules once, and cloning new engines in the background and only switching over when the new ones are ready. One engine per source may also help avoid contention issues, lessen drops, and make it easier to debug. Do you want to be able to overlay events from multiple sources on top of one another?

terylt avatar Jul 14 '22 14:07 terylt

Thanks for reviewing @terylt! This is actually a very valid suggestion, and It's something we should consider. The clear benefit of this is that multiple event sources with the same name could be loaded (with the exception of syscall, that carries other technical constraints), and that falco_engine would not need to be thread-safe anymore.

The downside of this is that loads would need to be loaded multiple times (slower initialization time). Cloning the engine would not work, because each engine would be configured with a given event source so they would all be different and rely on distinct sets of available rule fields. Also, it would make hot-reload features like the one described in https://github.com/falcosecurity/falco/issues/1692#issuecomment-1110951413 harder to implement.

At the same time, this is the one change we can consider making even later into the process. So since we don't have data races in our current Engine implementation, my call would be to have a first feature release that uses only one engine, and eventually switch to multiple ones if we find limitations while experimenting. What do you think?

jasondellaluce avatar Jul 14 '22 15:07 jasondellaluce

@jasondellaluce What takes the longest for initialization of the engine? Is it compiling the rules into a datastructure? If so, is that data structure read-only? I.e., could it be shared across multiple engines? That way you'd only have to compile it once, and share it. If it's not being modified by any of the engines, you could get away with one. If we have a list of major initialization costs, we could figure out ways to reduce each one.

In terms of multi engine vs. single engine. I don't have a strong opinion and am happy with whatever route you want to take. Just sharing from my past experience on this topic. I find pushing all events into a single pipe can be harder to get right coding-wise and can introduce overheads in context-switching and lock contention, so I tend to gravitate to avoid locking and contention issues when I can. I'm happy to help or support in anyway I can during the building of this feature, whichever approach you decide.

terylt avatar Jul 14 '22 15:07 terylt

@terylt the initialization cost of the rule engine is related to loading the rule files themselves. If we decide to have one engine for each event source, each engine should load all the configured rule files, because the way rule files are loaded is influenced by the event sources known by the rule engine. For example, the engine throws errors if a rule uses a field that is not defined for the rule's event source, and skips rules of unknown event source. There would be nothing to share between engines, if not lists and rules abstract definitions. Plus, currently rule files reading and compilation into an evaluable data structure currently happens in one step, so this approach would require relevant refactoring at that level.

The reason why I didn't yet do deeper research on this route is that having a 1 shared rule engine still does not require any locking mechanism in our case. The rule engine is currently totally partitioned by event source, so as long as we assign 1 event source to 1 thread, everything will work as-is. The only synchronization point is the rule stats manager inside the rule engine, which can just be turned into an atomic counter.

EDIT: another point for having 1 shared engine is that we have aggregated stats by design. If we do otherwise, we'd need some way to merge the results of stats_managers coming from different engines.

jasondellaluce avatar Jul 14 '22 15:07 jasondellaluce

(D1) The feature is implemented in Falco only, and mostly only affects the codebase of falcosecurity/falco. Both libsinsp and libscap will keep working in single-source mode

Also late to the party, but we have a use case for wanting support for this at the libs level (using the libs as the basis for a collector/sensor, but none of the other bits). Might not be a common use case, but figured I'd mention it 😀.

Looking at the discussion around (B9) my vote would be to go with the @mstemm's suggestion and try to find a way to cleanly handle the globals, and (perhaps?) be able to go the route of adding libs-level support?

(B9): You're right that these might be safe because they are read-only, but I think this would be a great time to move them into the inspector, simply as a cleanup step. I think there are some other potential globals like the dns manager singleton that need to be cleaned up as well.

dwindsor avatar Jul 14 '22 17:07 dwindsor

@dwindsor thanks for reviewing!

Also late to the party, but we have a use case for wanting support for this at the libs level (using the libs as the basis for a collector/sensor, but none of the other bits). Might not be a common use case, but figured I'd mention it 😀.

No worries, we're still in the process of designing and developing this. As for the analysis of https://github.com/falcosecurity/falco/issues/2074#issuecomment-1162922363, I couldn't find an optimal way to support event source parallelism in neither the libscap nor the libsinsp levels.

I think your point is very valid, supporting this in libs could be valuable for all the libs stakeholder. Do you have something specific in mind? Given the analysis linked above, here's my proposal. We could maintain the assumption that each sinsp inspector is single-source and single-thread, and then created another library inside libs that implements the inspector parallelization logic. Such a library would sit inside something like userspace/libsinp-multi-source, and would become optionally usable for consumers that wish to run more than one inspector, with different event sources, and eventually with multi-threading. This would make the Falco code easier to write, would have no effect on current libsinsp/libscap implementations, and would also enable libs consumer to benefit from this new feature. Let me know if this would work for you, so that I can proceed with a POC.

Looking at the discussion around (B9) my vote would be to go with the @mstemm's suggestion and try to find a way to cleanly handle the globals, and (perhaps?) be able to go the route of adding libs-level support?

I agree with both you and @mstemm. My opinion here is that these (Falco multi-evt source and libs globals cleanup) should follow two separate workstreams and efforts. The former could be properly implemented even without the latter, so I'd propose to work on both, even in parallel if there is enough developer capacity. My fear is that removing sinsp's globals will be a bit tedious.

jasondellaluce avatar Jul 15 '22 10:07 jasondellaluce

FYI, I moved the discussion of point (B3) about the Plugin SDK Go into another dedicated issue. There are some interesting challenges involved, and I felt like it was worth documenting the thought process so that everyone could jump in and share some feedback.

👉🏼 https://github.com/falcosecurity/plugin-sdk-go/issues/62

jasondellaluce avatar Jul 16 '22 16:07 jasondellaluce

Just my 2c here! I agree: we can provide an abstraction library on top of libsinsp; at the very same time, i don't really think it matters: since "opening a sinsp inspector" won't require any synchronization, clients of the libs can just open multiple inspectors binding each of them to a new thread. Therefore, i can't really see the practicality of yet another library just to abstract multi-libsinsp support. In any case, supporting something like that should not be a big deal in the end.

My opinion here is that these (Falco multi-evt source and libs globals cleanup) should follow two separate workstreams and efforts. The former could be properly implemented even without the latter, so I'd propose to work on both, even in parallel if there is enough developer capacity. My fear is that removing sinsp's globals will be a bit tedious.

Agree with this also! We would be thrilled to see contributions in this sense too :)

Btw this is a huge effort and will be the feature of Falco 0.33 IMHO, and this discussion seems to confirm that! So glad @jasondellaluce started this thread :)

FedeDP avatar Jul 16 '22 17:07 FedeDP

Do you have something specific in mind?

@jasondellaluce thanks for the detailed analysis 🙏. Thinking about this more, I'm not sure our use case should be addressed by adding libs-level support for ingesting plugin data.

We currently are collecting syscall telemetry using libsinsp, serializing the telemetry and dispatching it to an analysis service, which then generates alerts, etc based upon state collected.

The use case I was thinking of was to write a plugin e.g. a uprobe that attaches to libssl and returns connection-related data, and having that plugin's data routed through libsinsp so we can receive that telemetry without installing full Falco.

But, I feel that libsinsp might not be the appropriate place for handling plugin logic. sinsp, to me, is for low-level system inspection (syscalls, etc) - highly granular data that comes in at a very high rate. SSL connection data doesn't feel like it belongs in a system inspect library.

We could maintain the assumption that each sinsp inspector is single-source and single-thread, and then created another library inside libs that implements the inspector parallelization logic.

Yeah, we'd need another level of indirection here, and also account for the hugely different speeds of events coming in... I agree with @FedeDP that it wouldn't be worth it.

We're already able to ingest other forms of telemetry alongside libsinsp syscall telemetry, so no big deal! 😀

dwindsor avatar Jul 17 '22 19:07 dwindsor

The use case I was thinking of was to write a plugin e.g. a uprobe that attaches to libssl and returns connection-related data, and having that plugin's data routed through libsinsp so we can receive that telemetry without installing full Falco.

@dwindsor I think your use case would be well suited to be implemented as a plugin. Falco plugins are actually "libs" plugins: the plugin framework is implemented at the libscap and libsinsp level. In fact, the goal of the plugin system was to allow collecting events and extracting fields from them from any kind of event source, just like we traditionally did for syscalls.

Yeah, we'd need another level of indirection here, and also account for the hugely different speeds of events coming in... I agree with @FedeDP that it wouldn't be worth it.

Agreed on avoiding another level of indirection. The "different speeds of events coming in" sort of confirms the decision of having single-source-single-thread sinsp inspectors. With this assumption, the different event rates is not relevant as you can have multiple inspectors running in parallel each with dedicated thread and event source, and then collect events from each of them whenever available. In this model, event sources do not influence each other and the only "sequential bottleneck" is the final event collection point, which will likely happen in a shared place. This is what I plan to do with Falco, with the extra step of evaluating event rules in parallel too and then collecting only matching rule alerts, which have a fairly lower rate.

In this perspective, I think the use case you described could be achieved by:

  • Developing a plugin in either C/C++ or Go to collect SSL connection-related data and defining fields to be extracted from that kind of event
  • In your application, spawning two distinct sinsp inspectors in two threads, one configured with your new plugin and one for syscall collection
  • Collecting and serializing the telemetry from both inspectors and dispatching it to your analysis service

In this way, you'd collect both types of events in the same way. I think this could be valuable for the community too, I would be happy to help if you think this route makes sense to you.

jasondellaluce avatar Jul 18 '22 08:07 jasondellaluce

This are moving forward! Just opened https://github.com/falcosecurity/falco/pull/2182, which should be the final piece of glue code that uses all the preliminary work to implement the multi-source feature.

jasondellaluce avatar Aug 30 '22 14:08 jasondellaluce

https://github.com/falcosecurity/falco/pull/2182 just got merged! 🥳

Will keep this open until the testing phase is over, just in case some extra changes are needed!

jasondellaluce avatar Sep 12 '22 14:09 jasondellaluce