cloudquery Add a Sync Metadata Table

Is your feature request related to a problem? Please describe.

Presently, to observe a sync's operational health, I have to parse the logs. At scale across multiple workers, I end up building out a dashboard that turns that log parsing into data tables and charts.

Describe the solution you'd like

The summary data regarding a resource tuple such as (account, region, resource) that is in the logs could be written to the destination as a SyncMetadata artifact.

Also recording sync start and stops would be valuable.

Describe alternatives you've considered

Presently I parse the logs for an AWS sync as below.

parse @message "table sync finished client=*:* errors=* module=* resources=* table=*" as Account, Region, Errors, tableModule, Total, tableName

It works as a reasonable basis to produce a data table that looks like the below.

timestamp	Account	Region	tableName	Total	Errors
v1	v2	v3	v4	v5	v6

Additional context

I've built some data tables around the errors such as

parse @message "error=\"operation error *: *, *\" client=*:* module=* table=*" as Service, Action, errorMsg, Account, tableClientRegion, tableModule, tableName

And for sync-level errors that cause the process to entirely bail out parse @message "Error: *" as errorMsg

But IMO error lines in the destination are less valuable than the summary output - knowing that errors have occurred and that the logs should be examined.

Mar 02 '23 22:03 getglad

Thanks for reporting this issue 👍
You can reach us via Discord too. If you enjoy using this project, please consider starring it for support

Mar 02 '23 22:03 github-actions[bot]

I think the issue https://github.com/cloudquery/cloudquery/issues/8146 is related

Mar 08 '23 11:03 michalz-rely

Hi 👋 For those watching this issue, we're working on a guide for parsing and observing CloudQuery logs using Datadog. You can see it here https://cloudquery-r4i7sxw4c-cloudquery.vercel.app/how-to-guides/datadog-observability (PR https://github.com/cloudquery/cloudquery/pull/10285), and we would love your feedback.

I also opened https://github.com/cloudquery/cloudquery/issues/10266 as a result, and we might add more fields to the logs base on use cases so they can be easily queried.

Please let us know if this could work for you to solve the same use case a sync metadata table solves. I would also point out that a sync metadata table might not work when CloudQuery fails to access the destination, or have an unexpected error during the sync.

Apr 25 '23 10:04 erezrokah

Sorry for the late follow up, the official guide is here https://www.cloudquery.io/how-to-guides/datadog-observability

May 29 '23 11:05 erezrokah

Hi 👋 Another update on this. We've launched a new preview feature to support OpenTelemetry, see the docs in https://www.cloudquery.io/docs/advanced-topics/monitoring#opentelemetry-preview and https://www.cloudquery.io/docs/reference/source-spec#otel_endpoint-preview.

All the latest sources support this feature.

Also this issue seems to overlap quite a bit with https://github.com/cloudquery/cloudquery/issues/4718, but I'll keep it open for comments.

Please respond with any feedback you have 🚀

Jul 19 '23 11:07 erezrokah

Thanks @erezrokah - this OpenTelemetry support seems interesting. Will be taking a look.

To reiterate on the original ask - I do still think there is a lot of value in having CloudQuery submit the stats to the destination. Ergonomically, I dislike the requirement of having to source a separate system to know the stats of a particular run.

In the case of CQ destinations that can support query operations, my ideal state would be to aggregate a stats collection, or cross reference a CQ config id with another table.

In all cases, it would provide a static manifest of jobs that have been run, vs parsing logs on systems that are likely to purge logs after a period of time. I also think there is a strong value in having CloudQuery be a source of truth about these stats, vs trusting we are parsing logs correctly.

In general, there is a lot to like with the system, but this is a pain point.

Jul 20 '23 15:07 getglad

Thanks @erezrokah - this OpenTelemetry support seems interesting. Will be taking a look.

To reiterate on the original ask - I do still think there is a lot of value in having CloudQuery submit the stats to the destination. Ergonomically, I dislike the requirement of having to source a separate system to know the stats of a particular run.

In the case of CQ destinations that can support query operations, my ideal state would be to aggregate a stats collection, or cross reference a CQ config id with another table.

In all cases, it would provide a static manifest of jobs that have been run, vs parsing logs on systems that are likely to purge logs after a period of time. I also think there is a strong value in having CloudQuery be a source of truth about these stats, vs trusting we are parsing logs correctly.

In general, there is a lot to like with the system, but this is a pain point.

Hey @getglad ! I'll weigh in here as I was researching some of the stats/workflow area recently. It's a tricky area because it starting to overlap with other orchestration tools and then we need to be careful not to implement them on one hand (as those features tends to grow very fast) and on the other still provide good out of the box capabilities.

How you are running cloudquery right now? Did you try things like Airflow, kestra or argo workflows (if you are using k8s). This can solve the whole summary thing and kinda take it to the next level by having history/ui/logs/duration/failed vs pass and so on. What do you think?

Jul 20 '23 18:07 yevgenypats

We'd also find a feature like this useful!

Currently we run Cloudquery on AWS Fargate and we have a little script that runs before each Cloudquery container starts which records metadata about the task to our Destination. It gets us some of the way there, but its not a great solution, theres a lot of information we want thats not that accessible for example which tables were synced, how many resources, and how many errors.

I think the most common use case, atleast for us, would be to have Cloudquery metadata automatically written to the Destination so that it can be joined with regular data. For example, for a vital dashboard I might want to have a field saying "Hey the data for this dashboard was last updated at X, and Y% of resources had errors when syncing."

The new cloudquery-sync-summary.json file helps get us a lot closer to where we want to be but still requires us to write a lot of custom code in order to ingest that data to our destination.

Apr 08 '24 11:04 AshCorr

In v5.19.0 we have added support for a new cloudquery_sync_summaries table. This new table enables you to store metadata about a sync in the destination you are using. It includes the following columns cli_version, destination_errors, destination_name, destination_path, destination_version, destination_warnings, resources, source_errors, source_name, source_path, source_version, source_warnings, sync_id, sync_time. You can enable it by updating your spec to something like this:

kind: destination
spec:
  name: "s3"
  path: "cloudquery/s3"
  registry: "cloudquery"
  version: "<PLUGIN_VERSION>"
  send_sync_summary:true

May 20 '24 15:05 bbernays

A couple of updates regarding this issue:

Starting from CLI version 5.25.0 and all plugins released in the past week, you can pass the sync command a --tables-metrics-location flag that will periodically print per table summary to a file. Usage via cloudquery sync spec.yml --tables-metrics-location metrics.txt for example.
Starting from CLI version 5.25.0 and all plugins released in the past week, we've improved the OpenTelemetry traces and metrics we send. There's an updated guide in https://docs.cloudquery.io/docs/advanced-topics/monitoring, also with a Datadog integration and dashboard we provide, image below ⬇️ . The dashboard supports filters by plugin, table and client ID (usually AWS account/region, or GCP project, depending on the plugins you use).

Jul 12 '24 10:07 erezrokah

cloudquery cloudquery copied to clipboard

Add a Sync Metadata Table

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

cloudquery
cloudquery copied to clipboard