cartography
cartography copied to clipboard
Introduce cartography SyncMetadata node to surface data freshness
Description:
Describe your idea. Please be detailed. If a feature request, please describe the desired behavior, what scenario it enables, and how it would be used.
We would like to enable cartography platform owners to quickly know how fresh the data in their graph is. Each node already has a lastupdated
field, but we currently do not have a mechanism that tells us if an entire sync job has finished. As discussed at the cartography open source meetings, this could be possible by introducing a "SyncMetadata" node (https://lyftoss.slack.com/archives/CTZUQL0KX/p1643307663254809?thread_ts=1643307352.391599&cid=CTZUQL0KX ).
Plan
One proposal for SyncMetadata nodes could look like this:
field | Type | Description |
---|---|---|
lastcompleted | datetime | the update tag of the last successful run |
resource type | str | The name of the sync, e.g. aws.s3, aws.ec2.instance, github, etc |
grouping id (I'm bad at names, please come up with something else) | str | e.g. the AWS account ID, or the GitHub organization ID, etc |
id | str | Concatenation of sync name, and sub resource ID |
Intended behavior
I don't know how to explain this in a generic way so I will use an example. Ideally we could run a query like
MATCH (sm:SyncMetadata)
return sm.lastcompleted, sm.resource_type, sm.grouping_id
and get back
lastupdated | resource_type | grouping_id |
---|---|---|
2022-02-05 | aws.s3 | accountid=1234 |
2022-02-06 | aws.ec2 | accountid=1234 |
2022-02-06 | aws.s3 | accountid=5678 |
2022-02-07 | aws.ec2 | accountid=5678 |
2022-02-08 | github | orgid=myorg |
This tells us the completion times of specific AWS syncs and the owning account that was synced, as well as the completion time of the Github sync and the owning organization.
Obviously there are many ways to do this and this schema design is only one way of doing it and I am open to ideas and am opening this issue to start a discussion.
@ryan-lane What do you think? Mind sharing your own schema design for SyncMetadata? :-)
cc: @ramonpetgrave64
One thing we should keep in mind is designing for multiple accounts/organisations. In setups with multiple Orgs of the same time (e.g., AWS or GCP), it is important to be able to distinguish between them (usually ingested by different jobs)
One thing we should keep in mind is designing for multiple accounts/organisations.
100%; see the "grouping_id" column above - please come up with a better name than that 😅
Somehow I completely glossed over the fact that Ryan shared his analysis job already (d'oh).
It's here: https://lyftoss.slack.com/archives/CTZUQL0KX/p1643307663254809?thread_ts=1643307352.391599&cid=CTZUQL0KX
{
"statements": [
{
"query": "MATCH (c:KubernetesCluster) MERGE (n:SyncMetadata{id: 'k8s-' + c.name}) SET n.type = 'k8s', n.resource_lastupdated = c.lastupdated, n.name = c.name, n.lastupdated = {UPDATE_TAG}",
"iterative": false
},
{
"query": "MATCH (n:SyncMetadata) WHERE n.type = 'k8s' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
"iterative": false
},
{
"query": "MATCH (a:AWSAccount) MERGE (n:SyncMetadata{id: 'aws-' + a.id}) SET n.type = 'aws', n.resource_lastupdated = a.lastupdated, n.name = a.name, n.account_id = a.id, n.lastupdated = {UPDATE_TAG}",
"iterative": false
},
{
"query": "MATCH (n:SyncMetadata) WHERE n.type = 'aws' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
"iterative": false
},
{
"query": "MATCH (g:GitHubOrganization) MERGE (n:SyncMetadata{id: 'github-' + g.username}) SET n.type = 'github', n.resource_lastupdated = g.lastupdated, n.name = g.username, n.lastupdated = {UPDATE_TAG}",
"iterative": false
},
{
"query": "MATCH (n:SyncMetadata) WHERE n.type = 'github' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
"iterative": false
},
{
"query": "MATCH (p:PagerDutyTeam) WITH p LIMIT 1 MERGE (n:SyncMetadata{id: 'pagerduty'}) SET n.type = 'pagerduty', n.resource_lastupdated = p.lastupdated, n.lastupdated = {UPDATE_TAG}",
"iterative": false
},
{
"query": "MATCH (n:SyncMetadata) WHERE n.type = 'pagerduty' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
"iterative": false
},
{
"query": "MATCH (o:OktaOrganization) MERGE (n:SyncMetadata{id: 'okta-' + o.id}) SET n.type = 'okta', n.resource_lastupdated = o.lastupdated, n.organization_id = o.id, n.lastupdated = {UPDATE_TAG}",
"iterative": false
},
{
"query": "MATCH (n:SyncMetadata) WHERE n.type = 'okta' AND n.lastupdated <> {UPDATE_TAG} DELETE (n)",
"iterative": false
}
],
"name": "Keep track of when resources were last updated"
}
Will think more on this a bit.
I'm not totally tied to my schema. It's very similar to what you're proposing, though.
In #763, I'm trying a model where we can invoke a utility function update_module_sync_metadata_node
to create the SyncMetadata nodes we're interested in. The schema is very similar to what @achantavy has, but I'm also very open to suggestions.
I realize the new schema is not discoverable from https://lyft.github.io/cartography/usage/schema.html I'll have to fix that and find a way to draft these documentation updates.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
implemented already