cloudflare-gcp icon indicating copy to clipboard operation
cloudflare-gcp copied to clipboard

Data Quality Tolerance?

Open FStephenQuaratiello opened this issue 3 years ago • 3 comments

Hi,

I've been noticing a slight (~1%) discrepancy between the number of records imported to BigQuery with this tool, and the number of requests reported by the Cloudflare GraphQL API for a given time period. For example, the GraphQL API reports 46,532 requests in a given hour, but in BigQuery, there are only 45,736 records with an EdgeStartTimestamp in that hour. A small difference, to be sure, but a noticeable one.

Is this within expectations? And is there a better way to measure the health/quality of data imported by this tool?

FStephenQuaratiello avatar Jan 24 '22 19:01 FStephenQuaratiello

Hey @FStephenQuaratiello would you mind providing the GraphQL query and the BigQuery SQL query that you ran so we can investigate further?

shagamemnon avatar Jan 25 '22 22:01 shagamemnon

Sure thing:

BigQuery query: '''SELECT COUNT(*), EXTRACT(HOUR from EdgeStartTimestamp) AS hour FROM [TABLE] WHERE EdgeStartTimestamp > TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND EdgeStartTimestamp < TIMESTAMP(CURRENT_DATE()) AND ARRAY_TO_STRING(ARRAY_REVERSE([ ARRAY_REVERSE(SPLIT(ClientRequestHost, "."))[ORDINAL(1)], ARRAY_REVERSE(SPLIT(ClientRequestHost, "."))[ORDINAL(2)] ]), ".") = '%s' GROUP BY hour ORDER BY hour'''

GraphQL query: """ query { viewer { zones(filter: {zoneTag: "%s"} ) { httpRequests1hGroups(limit:24, filter:{date: "%s"}) { sum { requests } dimensions { datetime } } } } } """

FStephenQuaratiello avatar Jan 26 '22 18:01 FStephenQuaratiello

Hey @FStephenQuaratiello! Thanks for sharing your queries. I'm not particularly familiar with this tool, but I can help with a GQL part.

httpRequests1hGroups represents hourly aggregated eyeball requests. Therefore, if you want to compare with other sources, you should also make sure that you're counting eyeball requests. On top of it, if you're using "today" in your query, then I would expect that the last hour would not be "full" because the aggregation nature assumes buffering.

victor-perov avatar Jan 31 '22 15:01 victor-perov