cartography icon indicating copy to clipboard operation
cartography copied to clipboard

Introduce cartography SyncMetadata node to surface data freshness

Open achantavy opened this issue 3 years ago • 9 comments

Description:

Describe your idea. Please be detailed. If a feature request, please describe the desired behavior, what scenario it enables, and how it would be used.

We would like to enable cartography platform owners to quickly know how fresh the data in their graph is. Each node already has a lastupdated field, but we currently do not have a mechanism that tells us if an entire sync job has finished. As discussed at the cartography open source meetings, this could be possible by introducing a "SyncMetadata" node (https://lyftoss.slack.com/archives/CTZUQL0KX/p1643307663254809?thread_ts=1643307352.391599&cid=CTZUQL0KX ).

Plan

One proposal for SyncMetadata nodes could look like this:

field Type Description
lastcompleted datetime the update tag of the last successful run
resource type str The name of the sync, e.g. aws.s3, aws.ec2.instance, github, etc
grouping id (I'm bad at names, please come up with something else) str e.g. the AWS account ID, or the GitHub organization ID, etc
id str Concatenation of sync name, and sub resource ID

Intended behavior

I don't know how to explain this in a generic way so I will use an example. Ideally we could run a query like

MATCH (sm:SyncMetadata)
return sm.lastcompleted, sm.resource_type, sm.grouping_id

and get back

lastupdated resource_type grouping_id
2022-02-05 aws.s3 accountid=1234
2022-02-06 aws.ec2 accountid=1234
2022-02-06 aws.s3 accountid=5678
2022-02-07 aws.ec2 accountid=5678
2022-02-08 github orgid=myorg

This tells us the completion times of specific AWS syncs and the owning account that was synced, as well as the completion time of the Github sync and the owning organization.

Obviously there are many ways to do this and this schema design is only one way of doing it and I am open to ideas and am opening this issue to start a discussion.

achantavy avatar Feb 01 '22 06:02 achantavy

@ryan-lane What do you think? Mind sharing your own schema design for SyncMetadata? :-)

achantavy avatar Feb 01 '22 06:02 achantavy

cc: @ramonpetgrave64

achantavy avatar Feb 01 '22 06:02 achantavy

One thing we should keep in mind is designing for multiple accounts/organisations. In setups with multiple Orgs of the same time (e.g., AWS or GCP), it is important to be able to distinguish between them (usually ingested by different jobs)

marco-lancini avatar Feb 01 '22 11:02 marco-lancini

One thing we should keep in mind is designing for multiple accounts/organisations.

100%; see the "grouping_id" column above - please come up with a better name than that 😅

achantavy avatar Feb 01 '22 17:02 achantavy

Somehow I completely glossed over the fact that Ryan shared his analysis job already (d'oh).

It's here: https://lyftoss.slack.com/archives/CTZUQL0KX/p1643307663254809?thread_ts=1643307352.391599&cid=CTZUQL0KX

{
  "statements": [
    {
      "query": "MATCH (c:KubernetesCluster) MERGE (n:SyncMetadata{id: 'k8s-' + c.name}) SET n.type = 'k8s', n.resource_lastupdated = c.lastupdated, n.name = c.name, n.lastupdated = {UPDATE_TAG}",
      "iterative": false
    },
    {
      "query": "MATCH (n:SyncMetadata) WHERE n.type = 'k8s' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
      "iterative": false
    },
    {
      "query": "MATCH (a:AWSAccount) MERGE (n:SyncMetadata{id: 'aws-' + a.id}) SET n.type = 'aws', n.resource_lastupdated = a.lastupdated, n.name = a.name, n.account_id = a.id, n.lastupdated = {UPDATE_TAG}",
      "iterative": false
    },
    {
      "query": "MATCH (n:SyncMetadata) WHERE n.type = 'aws' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
      "iterative": false
    },
    {
      "query": "MATCH (g:GitHubOrganization) MERGE (n:SyncMetadata{id: 'github-' + g.username}) SET n.type = 'github', n.resource_lastupdated = g.lastupdated, n.name = g.username, n.lastupdated = {UPDATE_TAG}",
      "iterative": false
    },
    {
      "query": "MATCH (n:SyncMetadata) WHERE n.type = 'github' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
      "iterative": false
    },
    {
      "query": "MATCH (p:PagerDutyTeam) WITH p LIMIT 1 MERGE (n:SyncMetadata{id: 'pagerduty'}) SET n.type = 'pagerduty', n.resource_lastupdated = p.lastupdated, n.lastupdated = {UPDATE_TAG}",
      "iterative": false
    },
    {
      "query": "MATCH (n:SyncMetadata) WHERE n.type = 'pagerduty' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
      "iterative": false
    },
    {
      "query": "MATCH (o:OktaOrganization) MERGE (n:SyncMetadata{id: 'okta-' + o.id}) SET n.type = 'okta', n.resource_lastupdated = o.lastupdated, n.organization_id = o.id, n.lastupdated = {UPDATE_TAG}",
      "iterative": false
    },
    {
      "query": "MATCH (n:SyncMetadata) WHERE n.type = 'okta' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
      "iterative": false
    }
  ],
  "name": "Keep track of when resources were last updated"
}

Will think more on this a bit.

achantavy avatar Feb 01 '22 17:02 achantavy

I'm not totally tied to my schema. It's very similar to what you're proposing, though.

ryan-lane avatar Feb 01 '22 17:02 ryan-lane

In #763, I'm trying a model where we can invoke a utility function update_module_sync_metadata_node to create the SyncMetadata nodes we're interested in. The schema is very similar to what @achantavy has, but I'm also very open to suggestions.

ramonpetgrave64 avatar Feb 10 '22 21:02 ramonpetgrave64

I realize the new schema is not discoverable from https://lyft.github.io/cartography/usage/schema.html I'll have to fix that and find a way to draft these documentation updates.

ramonpetgrave64 avatar Feb 14 '22 17:02 ramonpetgrave64

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Apr 17 '22 04:04 stale[bot]

implemented already

ramonpetgrave64 avatar Jul 06 '23 18:07 ramonpetgrave64