Generalized node enrichment

Open achantavy opened this issue 3 months ago • 0 comments

Enriching existing nodes in Cartography (e.g., marking an asset as internet_exposed) requires writing custom analysis jobs with raw Cypher and manual cleanup logic. This is error-prone and hard to test.

We propose a structured, reusable interface - similar in spirit to matchlinks - that allows module authors to declaratively match nodes and set properties, with automatic cleanup and core testing support.

Description

Describe your idea. Please be detailed. If a feature request, please describe the desired behavior, what scenario it enables, and how it would be used.

We should create an interface that allows us to set properties on a given node label without needing to run cypher queries, or without writing analysis jobs (which require careful cleanup jobs).

As an analogy, with matchlinks, we create relationships between existing nodes in the graph using a structured, well-tested interface. With this node enrichment feature, we will set node properties for existing nodes in the graph.

Rough idea

We would accept as inputs

node label
matcher (this is a set of criteria to find what nodes to change)
- node property key
- node property value
setter (this is mapping of the keys and values to be set for the nodes that were matched on)
- node property key
- node property value

Output/effect:

Match on all nodes of the given label which meet the criteria specified by the matcher
Set their attributes according to the setter

The setter could be used to perform data enrichment. For example, we could define an internet_exposed field in the setter that marks a node's internet_exposed attribute as True.

Motivation

Why is this feature needed? What problem does it solve or opportunity does it unlock?

Setting node properties based on other conditions is currently performed by analysis jobs. Analysis jobs allow the user to run arbitrary cypher queries and are (1) tricky to ensure proper cleanup and (2) difficult to test.

An example of (1) is if a given asset is no longer internet exposed, then the query author must write match (n:SomeLabel)--(r:SubResource) where n.lastupdated <> $UPDATE_TAG remove n.internet_exposed or similar. this can be cumbersome and error prone.

For (2), getting analysis jobs wrong can result in bad data in the graph (e.g. like how exposed_internet_type was a list that infinitely got added to in the past https://github.com/cartography-cncf/cartography/issues/386), so one solution is to write tests for it. Expecting analysis job writers to create integration tests for every single new analysis job is not feasible.

It's helpful to have some core, well-tested functionality in cartography that sets node attributes based on other conditions that a module author wants while automatically cleaning them up, and ensuring that the core is well-tested (similar to how cleanup jobs derived from node schema objects are well tested for a few scenarios but it is not necessary to explicitly test for them in every intel module).

I'd like to make cartography analysis to be more predictable and more robust. There are cases when using raw cypher is likely better than any abstraction we make, but I'd like to explore ways to enable others to cleanly and quickly enrich the graph.

Alternatives Considered

List other approaches or ideas considered, and why they were not chosen.

We could continue writing analysis jobs and start expecting module authors to add integration tests to their analysis jobs.

Thoughts?

Sep 09 '25 05:09 achantavy