airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

Google Search Console connector not fetching more than 50K rows on a given day

Open dancook-doxo opened this issue 2 years ago • 2 comments

Environment

  • Airbyte version: Airbyte OS (0.40.15)
  • OS Version / Instance: AWS EC2
  • Deployment: Docker
  • Source Connector and version: Google Search Console v0.1.18
  • Destination Connector and version: Snowflake v0.4.42
  • Step where error happened: Sync job

Current Behavior

The Google Search Console data inserted into our warehouse by Airbyte, for the custom report whose JSON appears below, has never amounted to more than 50,484 records in a single day. But we know from Google's API that our daily total for this set of dimensions has gone as high as 155K daily records. So I suspect there is a limitation in the Airbyte GSC connector which stops new API requests for a given day at the next request after a limit of 50K rows has been reached. I've checked the source code for the connector and nothing obvious sticks out. 50K rows in a day should be well under the QPD quotas: https://developers.google.com/webmaster-tools/limits The log always ends with something like "Read 100169 records from keyword_page_report stream" (50K x2 because of SCD records).

{
  "name": "KEYWORD_PAGE_REPORT",
  "dimensions": [
     "date",
     "country",
     "device",
     "query",
     "page"
  ]
}

Expected Behavior

I expect to get more than 50K rows on a given day if our site's usage has that much data to provide. In fact I expect to get 150K on most days.

Logs

3804963a_8f8c_4a0b_91ae_d85d1d37caa4_logs_3110_txt.txt

Steps to Reproduce

  1. set up GSC connector, configure for relevant Website URL, pick start date, etc.
  2. provide custom report JSON
  3. begin syncing
  4. on a day where it is known that the site has more than 50K of rows to fetch from the GSC API, check how many were actually laid down by Airbyte. In our case the number has never been more than 50,986. And out of 580 total dates we've had more than 50K rows on 440 of them, and that includes almost a full year of consecutive days at more than 50K.

Are you willing to submit a PR?

No

dancook-doxo avatar Jan 18 '23 16:01 dancook-doxo

Note also that a different report, called KEYWORD_SITE_REPORT_BY_PAGE, has never returned less than 50,033 and never more than 50,901 records.

dancook-doxo avatar Jan 18 '23 16:01 dancook-doxo

As documented here, this is theoretically a fundamental limitation enforced by Google themselves. But the curious thing is that when using Fivetran to fetch the same set of dimensions and metrics, it's possible to get more than 50K daily rows. In fact we've seen days where the # of rows gets up above 150K (most recently April of 2022). Further, we haven't seen a day with fewer than 55K rows in over two months.

So the $64K question is this: how does Fivetran work around the 50K limitation, and can the Airbyte connector do the same?

dancook-doxo avatar Feb 01 '23 16:02 dancook-doxo

From the google docs:

The maximum you can export through the Search Console user interface is 1,000 rows of data. Currently, the upper limit for the data exported through the Search Analytics API (and through the Looker Studio connector) is 50,000 rows per day per site per search type, which may not be reached in all cases ... For requests that don't involve query or URL dimensions, such as countries, devices, and Search Appearances, Search Console will display and export all the data.

Some thoughts:

  • FT could be splitting out the queries by search type to maximize the available rate limit
  • FT could be using requests that don't involve query or URL dimensions
  • Special rate limits are available to FT

Action items for us:

  • investigate if we can split our queries or make them more granular somehow in order to get more data from the connector

sherifnada avatar Jun 23 '23 18:06 sherifnada

At Airbyte, we seek to be clear about the project priorities and roadmap. This issue has not had any activity for 180 days, suggesting that it's not as critical as others. It's possible it has already been fixed. It is being marked as stale and will be closed in 20 days if there is no activity. To keep it open, please comment to let us know why it is important to you and if it is still reproducible on recent versions of Airbyte.

octavia-squidington-iii avatar May 08 '24 09:05 octavia-squidington-iii

The behavior observed from Airbyte's GSC connector hasn't changed since first opening the ticket. We have come to live with the data limitation though.

dancook-doxo avatar May 08 '24 15:05 dancook-doxo