cartography icon indicating copy to clipboard operation
cartography copied to clipboard

Use new data model for Kubernetes intel module

Open ishaanverma opened this issue 8 months ago • 2 comments

Summary

Describe your changes.

change the kubernetes intel module to use the cartography data model and introduce a new schema for kubernetes. this is a potentially breaking change, since this changes the relationships between the nodes and a few properties of the nodes.

the old schema looked like this: current

the new schema looks like this (the pink relations show sub-resource relationships): proposed

the goal of the new schema is to better mirror the k8s object model and allow for easy extensibility to add more resource types in the future.

Related issues or links

Include links to relevant issues or other pages.

  • https://github.com/lyft/cartography/issues/...

Checklist

Provide proof that this works (this makes reviews move faster). Please perform one or more of the following:

  • [ ] Update/add unit or integration tests.
  • [ ] Include a screenshot showing what the graph looked like before and after your changes.
  • [ ] Include console log trace showing what happened before and after your changes.

If you are changing a node or relationship:

If you are implementing a new intel module:

ishaanverma avatar Apr 18 '25 20:04 ishaanverma

re: extra_index properties on the nodes. even though the node is connected to another node that has the same value, the extra index just helps with queries like:

(n:KubernetesPod{cluster_name: 'xyz'}) RETURN n
(n:KubernetesPod{namespace: 'abc'}) RETURN n

and i think the extra index should also help with the ingestion for cases like:

@dataclass(frozen=True)
# (:KubernetesContainer)<-[:CONTAINS]-(:KubernetesPod)
class KubernetesContainerToKubernetesPodRel(CartographyRelSchema):
    target_node_label: str = "KubernetesPod"
    target_node_matcher: TargetNodeMatcher = make_target_node_matcher(
        {
            "cluster_name": PropertyRef("CLUSTER_NAME", set_in_kwargs=True),
            "namespace": PropertyRef("namespace"),
            "id": PropertyRef("pod_id"),
        }
    )
    direction: LinkDirection = LinkDirection.INWARD
    rel_label: str = "CONTAINS"
    properties: KubernetesContainerToKubernetesPodRelProperties = (
        KubernetesContainerToKubernetesPodRelProperties()
    )

where the target node matcher relies on matching the namespace property of a node

ishaanverma avatar Jun 25 '25 07:06 ishaanverma

the cleanup jobs are still something i need to look into. i think in some cases they dont seem to work correctly... maybe im missing something

ishaanverma avatar Jun 25 '25 07:06 ishaanverma

had a chance to look at the scoped_cleanup flag. in this case setting scoped_cleanup to False will not trigger a cleanup because of this condition in cartography/jobs/cleanupbuilder.py -> build_cleanup_queries:

if (
    not node_schema.sub_resource_relationship
    and not node_schema.other_relationships
):
    return []

The KubernetesClusterSchema does not have any relationships defined so the automatic cleanup job will just return empty queries.

ishaanverma avatar Jun 27 '25 05:06 ishaanverma

also wanted to point out another scenario, that took me a while to understand.

let's say you have two clusters A and B. if you run the cartography intel job only on cluster A first, then run the intel job only on cluster B. The second time you run it, no nodes will be cleaned up in cluster A because the cleanup queries rely on a CLUSTER_ID parameter. for e.g. the cleanup queries generated by the cleanup job will be:

MATCH (n:KubernetesNamespace)<-[s:RESOURCE]-(:KubernetesCluster{id: $CLUSTER_ID})
WHERE n.lastupdated <> $UPDATE_TAG
WITH n LIMIT $LIMIT_SIZE
DETACH DELETE n;

MATCH (n:KubernetesNamespace)<-[s:RESOURCE]-(:KubernetesCluster{id: $CLUSTER_ID})
WHERE s.lastupdated <> $UPDATE_TAG
WITH s LIMIT $LIMIT_SIZE
DELETE s  

so the cleanup is just scoped to the cluster currently being ingested. which makes sense, you could execute multiple cluster ingestion jobs that run in parallel without interfering with each other.

ishaanverma avatar Jun 27 '25 06:06 ishaanverma

Regarding the cleanup you're right, my bad, I didn’t check whether your node had any relationships.

As for the extra index, I usually go with a query like (Pod)<-[:HAS]-(Namespace {id: x}), but if you prefer the solution with the extra index, I’m totally fine with that.

jychp avatar Jun 27 '25 06:06 jychp