cartography Introduce cartography SyncMetadata node to surface data freshness

Description:

Describe your idea. Please be detailed. If a feature request, please describe the desired behavior, what scenario it enables, and how it would be used.

We would like to enable cartography platform owners to quickly know how fresh the data in their graph is. Each node already has a lastupdated field, but we currently do not have a mechanism that tells us if an entire sync job has finished. As discussed at the cartography open source meetings, this could be possible by introducing a "SyncMetadata" node (https://lyftoss.slack.com/archives/CTZUQL0KX/p1643307663254809?thread_ts=1643307352.391599&cid=CTZUQL0KX ).

Plan

One proposal for SyncMetadata nodes could look like this:

field	Type	Description
lastcompleted	datetime	the update tag of the last successful run
resource type	str	The name of the sync, e.g. aws.s3, aws.ec2.instance, github, etc
grouping id (I'm bad at names, please come up with something else)	str	e.g. the AWS account ID, or the GitHub organization ID, etc
id	str	Concatenation of sync name, and sub resource ID

Intended behavior

I don't know how to explain this in a generic way so I will use an example. Ideally we could run a query like

MATCH (sm:SyncMetadata)
return sm.lastcompleted, sm.resource_type, sm.grouping_id

and get back

lastupdated	resource_type	grouping_id
2022-02-05	aws.s3	accountid=1234
2022-02-06	aws.ec2	accountid=1234
2022-02-06	aws.s3	accountid=5678
2022-02-07	aws.ec2	accountid=5678
2022-02-08	github	orgid=myorg

This tells us the completion times of specific AWS syncs and the owning account that was synced, as well as the completion time of the Github sync and the owning organization.

Obviously there are many ways to do this and this schema design is only one way of doing it and I am open to ideas and am opening this issue to start a discussion.

Feb 01 '22 06:02 achantavy

@ryan-lane What do you think? Mind sharing your own schema design for SyncMetadata? :-)

Feb 01 '22 06:02 achantavy

cc: @ramonpetgrave64

Feb 01 '22 06:02 achantavy

One thing we should keep in mind is designing for multiple accounts/organisations. In setups with multiple Orgs of the same time (e.g., AWS or GCP), it is important to be able to distinguish between them (usually ingested by different jobs)

Feb 01 '22 11:02 marco-lancini

One thing we should keep in mind is designing for multiple accounts/organisations.

100%; see the "grouping_id" column above - please come up with a better name than that 😅

Feb 01 '22 17:02 achantavy

Somehow I completely glossed over the fact that Ryan shared his analysis job already (d'oh).

It's here: https://lyftoss.slack.com/archives/CTZUQL0KX/p1643307663254809?thread_ts=1643307352.391599&cid=CTZUQL0KX

{
  "statements": [
    {
      "query": "MATCH (c:KubernetesCluster) MERGE (n:SyncMetadata{id: 'k8s-' + c.name}) SET n.type = 'k8s', n.resource_lastupdated = c.lastupdated, n.name = c.name, n.lastupdated = {UPDATE_TAG}",
      "iterative": false
    },
    {
      "query": "MATCH (n:SyncMetadata) WHERE n.type = 'k8s' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
      "iterative": false
    },
    {
      "query": "MATCH (a:AWSAccount) MERGE (n:SyncMetadata{id: 'aws-' + a.id}) SET n.type = 'aws', n.resource_lastupdated = a.lastupdated, n.name = a.name, n.account_id = a.id, n.lastupdated = {UPDATE_TAG}",
      "iterative": false
    },
    {
      "query": "MATCH (n:SyncMetadata) WHERE n.type = 'aws' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
      "iterative": false
    },
    {
      "query": "MATCH (g:GitHubOrganization) MERGE (n:SyncMetadata{id: 'github-' + g.username}) SET n.type = 'github', n.resource_lastupdated = g.lastupdated, n.name = g.username, n.lastupdated = {UPDATE_TAG}",
      "iterative": false
    },
    {
      "query": "MATCH (n:SyncMetadata) WHERE n.type = 'github' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
      "iterative": false
    },
    {
      "query": "MATCH (p:PagerDutyTeam) WITH p LIMIT 1 MERGE (n:SyncMetadata{id: 'pagerduty'}) SET n.type = 'pagerduty', n.resource_lastupdated = p.lastupdated, n.lastupdated = {UPDATE_TAG}",
      "iterative": false
    },
    {
      "query": "MATCH (n:SyncMetadata) WHERE n.type = 'pagerduty' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
      "iterative": false
    },
    {
      "query": "MATCH (o:OktaOrganization) MERGE (n:SyncMetadata{id: 'okta-' + o.id}) SET n.type = 'okta', n.resource_lastupdated = o.lastupdated, n.organization_id = o.id, n.lastupdated = {UPDATE_TAG}",
      "iterative": false
    },
    {
      "query": "MATCH (n:SyncMetadata) WHERE n.type = 'okta' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
      "iterative": false
    }
  ],
  "name": "Keep track of when resources were last updated"
}

Will think more on this a bit.

Feb 01 '22 17:02 achantavy

I'm not totally tied to my schema. It's very similar to what you're proposing, though.

Feb 01 '22 17:02 ryan-lane

In #763, I'm trying a model where we can invoke a utility function update_module_sync_metadata_node to create the SyncMetadata nodes we're interested in. The schema is very similar to what @achantavy has, but I'm also very open to suggestions.

Feb 10 '22 21:02 ramonpetgrave64

I realize the new schema is not discoverable from https://lyft.github.io/cartography/usage/schema.html I'll have to fix that and find a way to draft these documentation updates.

Feb 14 '22 17:02 ramonpetgrave64

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

Apr 17 '22 04:04 stale[bot]

implemented already

Jul 06 '23 18:07 ramonpetgrave64

cartography cartography copied to clipboard

Introduce cartography SyncMetadata node to surface data freshness

Plan

Intended behavior

cartography
cartography copied to clipboard