cloudquery
cloudquery copied to clipboard
Add a Sync Metadata Table
Is your feature request related to a problem? Please describe.
Presently, to observe a sync's operational health, I have to parse the logs. At scale across multiple workers, I end up building out a dashboard that turns that log parsing into data tables and charts.
Describe the solution you'd like
The summary data regarding a resource tuple such as (account, region, resource) that is in the logs could be written to the destination as a SyncMetadata artifact.
Also recording sync start and stops would be valuable.
Describe alternatives you've considered
Presently I parse the logs for an AWS sync as below.
parse @message "table sync finished client=*:* errors=* module=* resources=* table=*" as Account, Region, Errors, tableModule, Total, tableName
It works as a reasonable basis to produce a data table that looks like the below.
| timestamp | Account | Region | tableName | Total | Errors |
|---|---|---|---|---|---|
| v1 | v2 | v3 | v4 | v5 | v6 |
Additional context
I've built some data tables around the errors such as
parse @message "error=\"operation error *: *, *\" client=*:* module=* table=*" as Service, Action, errorMsg, Account, tableClientRegion, tableModule, tableName
And for sync-level errors that cause the process to entirely bail out
parse @message "Error: *" as errorMsg
But IMO error lines in the destination are less valuable than the summary output - knowing that errors have occurred and that the logs should be examined.
Thanks for reporting this issue 👍
You can reach us via Discord too.
If you enjoy using this project, please consider starring it for support
I think the issue https://github.com/cloudquery/cloudquery/issues/8146 is related
Hi 👋 For those watching this issue, we're working on a guide for parsing and observing CloudQuery logs using Datadog. You can see it here https://cloudquery-r4i7sxw4c-cloudquery.vercel.app/how-to-guides/datadog-observability (PR https://github.com/cloudquery/cloudquery/pull/10285), and we would love your feedback.
I also opened https://github.com/cloudquery/cloudquery/issues/10266 as a result, and we might add more fields to the logs base on use cases so they can be easily queried.
Please let us know if this could work for you to solve the same use case a sync metadata table solves. I would also point out that a sync metadata table might not work when CloudQuery fails to access the destination, or have an unexpected error during the sync.
Sorry for the late follow up, the official guide is here https://www.cloudquery.io/how-to-guides/datadog-observability
Hi 👋 Another update on this. We've launched a new preview feature to support OpenTelemetry, see the docs in https://www.cloudquery.io/docs/advanced-topics/monitoring#opentelemetry-preview and https://www.cloudquery.io/docs/reference/source-spec#otel_endpoint-preview.
All the latest sources support this feature.
Also this issue seems to overlap quite a bit with https://github.com/cloudquery/cloudquery/issues/4718, but I'll keep it open for comments.
Please respond with any feedback you have 🚀
Thanks @erezrokah - this OpenTelemetry support seems interesting. Will be taking a look.
To reiterate on the original ask - I do still think there is a lot of value in having CloudQuery submit the stats to the destination. Ergonomically, I dislike the requirement of having to source a separate system to know the stats of a particular run.
In the case of CQ destinations that can support query operations, my ideal state would be to aggregate a stats collection, or cross reference a CQ config id with another table.
In all cases, it would provide a static manifest of jobs that have been run, vs parsing logs on systems that are likely to purge logs after a period of time. I also think there is a strong value in having CloudQuery be a source of truth about these stats, vs trusting we are parsing logs correctly.
In general, there is a lot to like with the system, but this is a pain point.
Thanks @erezrokah - this OpenTelemetry support seems interesting. Will be taking a look.
To reiterate on the original ask - I do still think there is a lot of value in having CloudQuery submit the stats to the destination. Ergonomically, I dislike the requirement of having to source a separate system to know the stats of a particular run.
In the case of CQ destinations that can support query operations, my ideal state would be to aggregate a stats collection, or cross reference a CQ config
idwith another table.In all cases, it would provide a static manifest of jobs that have been run, vs parsing logs on systems that are likely to purge logs after a period of time. I also think there is a strong value in having CloudQuery be a source of truth about these stats, vs trusting we are parsing logs correctly.
In general, there is a lot to like with the system, but this is a pain point.
Hey @getglad ! I'll weigh in here as I was researching some of the stats/workflow area recently. It's a tricky area because it starting to overlap with other orchestration tools and then we need to be careful not to implement them on one hand (as those features tends to grow very fast) and on the other still provide good out of the box capabilities.
How you are running cloudquery right now? Did you try things like Airflow, kestra or argo workflows (if you are using k8s). This can solve the whole summary thing and kinda take it to the next level by having history/ui/logs/duration/failed vs pass and so on. What do you think?
We'd also find a feature like this useful!
Currently we run Cloudquery on AWS Fargate and we have a little script that runs before each Cloudquery container starts which records metadata about the task to our Destination. It gets us some of the way there, but its not a great solution, theres a lot of information we want thats not that accessible for example which tables were synced, how many resources, and how many errors.
I think the most common use case, atleast for us, would be to have Cloudquery metadata automatically written to the Destination so that it can be joined with regular data. For example, for a vital dashboard I might want to have a field saying "Hey the data for this dashboard was last updated at X, and Y% of resources had errors when syncing."
The new cloudquery-sync-summary.json file helps get us a lot closer to where we want to be but still requires us to write a lot of custom code in order to ingest that data to our destination.
In v5.19.0 we have added support for a new cloudquery_sync_summaries table. This new table enables you to store metadata about a sync in the destination you are using. It includes the following columns cli_version, destination_errors, destination_name, destination_path, destination_version, destination_warnings, resources, source_errors, source_name, source_path, source_version, source_warnings, sync_id, sync_time. You can enable it by updating your spec to something like this:
kind: destination
spec:
name: "s3"
path: "cloudquery/s3"
registry: "cloudquery"
version: "<PLUGIN_VERSION>"
send_sync_summary:true
A couple of updates regarding this issue:
- Starting from CLI version 5.25.0 and all plugins released in the past week, you can pass the sync command a
--tables-metrics-locationflag that will periodically print per table summary to a file. Usage viacloudquery sync spec.yml --tables-metrics-location metrics.txtfor example. - Starting from CLI version 5.25.0 and all plugins released in the past week, we've improved the OpenTelemetry traces and metrics we send. There's an updated guide in https://docs.cloudquery.io/docs/advanced-topics/monitoring, also with a Datadog integration and dashboard we provide, image below ⬇️ . The dashboard supports filters by plugin, table and client ID (usually AWS account/region, or GCP project, depending on the plugins you use).