googleAnalyticsR icon indicating copy to clipboard operation
googleAnalyticsR copied to clipboard

Problem pulling more than 10,000 clientIDs

Open allen-black opened this issue 5 years ago • 37 comments

What goes wrong

I am trying to pull the number of page views that each clientID has over a period of time (longer-term goal is to get ga:clientID and ga:pagePath together, but I run into this issue in that report, too), I'm getting just a result of 10,000 rows. I get this no matter the date range and GA tells me I have more users than that in the time period. I'm able to get dimensions to report more rows, but for some reason this one is limiting me?

I had max=-1 in before and just tried 999999 based on previous posts.

pageViews <- function (start, end) {
  data <- google_analytics(
    viewId = ga_id,
    date_range = c(start, end),
    metrics = "ga:pageviews",
    dimensions = c('ga:clientID'),
    max = 999999
  )
  return(data)
}

Steps to reproduce the problem

  • Find a View that has more than 10,000 unique clientIDs in a period.
  • Use the code above and see if you get a greater result.

Expected output

  • More than 10,000 rows of unique clientIDs

Actual output

  • Always limited to 10,000 no matter the date.

##Before you run your code, please run: 2019-10-15 07:47:11> Multi-call to API 2019-10-15 07:47:12> Calling APIv4.... 2019-10-15 07:47:12> Multiple v4 batch 2019-10-15 07:47:12> Fetching v4 data batch... 2019-10-15 07:47:12> Request: https://analyticsreporting.googleapis.com/v4/reports:batchGet/ 2019-10-15 07:47:12> Body JSON parsed to: {"reportRequests":[{"viewId":"ga:1911XXXXX","dateRanges":[{"startDate":"2019-10-12","endDate":"2019-10-14"}],"samplingLevel":"DEFAULT","dimensions":[{"name":"ga:clientID"}],"metrics":[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECIFIED"}],"pageToken":"0","pageSize":10000,"includeEmptyRows":true},{"viewId":"ga:1911XXXXX","dateRanges":[{"startDate":"2019-10-12","endDate":"2019-10-14"}],"samplingLevel":"DEFAULT","dimensions":[{"name":"ga:clientID"}],"metrics":[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECIFIED"}],"pageToken":"10000","pageSize":10000,"includeEmptyRows":true},{"viewId":"ga:1911XXXXX","dateRanges":[{"startDate":"2019-10-12","endDate":"2019-10-14"}],"samplingLevel":"DEFAULT","dimensions":[{"name":"ga:clientID"}],"metrics":[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECIFIED"}],"pageToken":"20000","pageSize":10000,"includeEmptyRows":true},{"viewId":"ga:1911XXXXX","dateRanges":[{"startDate":"2019-10-12","endDate":"2019-10-14"}],"samplingLevel":"DEFAULT","dimensions":[{"name":"ga:clientID"}],"metrics":[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECI....[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECIFIED"}],"pageToken":"980000","pageSize":10000,"includeEmptyRows":true},{"viewId":"ga:1911XXXXX","dateRanges":[{"startDate":"2019-10-12","endDate":"2019-10-14"}],"samplingLevel":"DEFAULT","dimensions":[{"name":"ga:clientID"}],"metrics":[{"expression":"ga:pageviews","alias":"ga:pageviews","formattingType":"METRIC_TYPE_UNSPECIFIED"}],"pageToken":"990000","pageSize":9999,"includeEmptyRows":true}]} 2019-10-15 07:49:22> Downloaded [10000] rows from a total of [10001].

Session Info

Please run sessionInfo() so we can check what versions of packages you have installed `Session Info R version 3.4.4 (2018-03-15) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS 10.14.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] factoextra_1.0.5 cluster_2.0.7-1 skmeans_0.2-11
[4] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.2
[7] purrr_0.3.2 readr_1.3.1 tidyr_0.8.3
[10] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.2.1
[13] googleAnalyticsR_0.6.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.4.2 compiler_3.4.4 googleAuthR_0.8.0 [6] tools_3.4.4 digest_0.6.20 packrat_0.5.0 clue_0.3-57 lubridate_1.7.4
[11] jsonlite_1.6 memoise_1.1.0 nlme_3.1-137 gtable_0.3.0 lattice_0.20-38
[16] pkgconfig_2.0.2 rlang_0.4.0 cli_1.1.0 rstudioapi_0.10 curl_3.3
[21] ggrepel_0.8.1 haven_2.1.1 withr_2.1.2 xml2_1.2.2 httr_1.4.0
[26] askpass_1.1 generics_0.0.2 hms_0.4.2 grid_3.4.4 tidyselect_0.2.5 [31] glue_1.3.1 R6_2.4.0 readxl_1.3.1 modelr_0.1.4 magrittr_1.5
[36] backports_1.1.4 scales_1.0.0 rvest_0.3.4 assertthat_0.2.1 colorspace_1.4-1 [41] stringi_1.4.3 openssl_1.4 lazyeval_0.2.2 munsell_0.5.0 slam_0.1-45
[46] broom_0.5.2 crayon_1.3.4

`

allen-black avatar Oct 15 '19 12:10 allen-black

One other note: I have another view that saves the clientID to via a customTask to custom dimension 19. If I use that instead of clientID, I am able to get 51,436 results when I pull with ga:dimension19, ga:pagePath.

allen-black avatar Oct 15 '19 12:10 allen-black

I can confirm this, and its weird. Can you see if it occurs via another library such as GA query explorer: https://ga-dev-tools.appspot.com/query-explorer/

If so, then its an API bug to be reported to Google

MarkEdmondson1234 avatar Oct 17 '19 13:10 MarkEdmondson1234

Query explorer doesn't have clientID dimension yet either. But I tried it also with a property that is capturing clientId in dimension75 and it downloaded it no issue 1million entries (max total) but same viewId only 10,000 available via clientId dimension.

For now stick to using the dimension if you have it, but I think this is a bug to report to Google.

MarkEdmondson1234 avatar Oct 17 '19 13:10 MarkEdmondson1234

I put it on this issue, if you have more details please put them there https://issuetracker.google.com/issues/142795352

MarkEdmondson1234 avatar Oct 17 '19 13:10 MarkEdmondson1234

The second page of the API always seems to only return 1 row, and the total rows is always 10001

Change the viewId and try on your own viewId here:

https://developers.google.com/analytics/devguides/reporting/core/v4/rest/v4/reports/batchGet?apix_params=%7B%22resource%22%3A%7B%22reportRequests%22%3A%7B%22viewId%22%3A%22ga%3A106249469%22%2C%22dateRanges%22%3A%5B%7B%22startDate%22%3A%222018-01-02%22%2C%22endDate%22%3A%222019-10-01%22%7D%5D%2C%22samplingLevel%22%3A%22DEFAULT%22%2C%22dimensions%22%3A%5B%7B%22name%22%3A%22ga%3AclientId%22%7D%5D%2C%22metrics%22%3A%5B%7B%22expression%22%3A%22ga%3Asessions%22%2C%22alias%22%3A%22sessions%22%2C%22formattingType%22%3A%22METRIC_TYPE_UNSPECIFIED%22%7D%5D%2C%22pageToken%22%3A%229999%22%2C%22pageSize%22%3A10000%2C%22includeEmptyRows%22%3Atrue%7D%7D%7D

MarkEdmondson1234 avatar Oct 19 '19 15:10 MarkEdmondson1234

Thanks for looking into this! When I run that wiht a different ViewID I definition get more data and it ends with nextPageToken": "19999"

allen-black avatar Oct 22 '19 17:10 allen-black

Hello. A have same problem. Any changes here?

Sozonov avatar Dec 10 '19 07:12 Sozonov

No updates yet, its within the API the issue occurs so need to wait for any updates there.

MarkEdmondson1234 avatar Dec 10 '19 08:12 MarkEdmondson1234

@MarkEdmondson1234 thanks. Do you have link to issue in API? Is working on this issue started?

Sozonov avatar Dec 10 '19 09:12 Sozonov

Not really, the private thread at the link is just with an example. It is officially in beta, so if it comes out of beta it should appear in the GA API news feed or then be available in the query explorer online.

MarkEdmondson1234 avatar Dec 10 '19 09:12 MarkEdmondson1234

It looks like this may be a limit per API call, so if the API call is broken down into calls under 10k rows you can get all data. If this is confirmed I'll auto do this in the function.

To do this something like below should work:

ga_call <- function(date_range, ...){
  per_day <- seq(date_range[[1]]), date_range[[2]], by = 1)
  calls <- lapply(per_day, function(x){
      message("Fetching: ", x)
      google_analytics(..., date_range = c(x, x))
  })
  Reduce(rbind, calls)
}

my_date_range <- c(as.Date("2019-01-01"), as.Date("2020-01-01"))

per_call(date_range = my_date_range, {put other google_analytics() arguments here})

It may be doable by using slow_fetch=TRUE:

google_analytics({etc}, slow_fetch=TRUE)

This won't help if you have more than 10k users a day. That will need to wait for the API to update.

MarkEdmondson1234 avatar Mar 17 '20 10:03 MarkEdmondson1234

Awesome. Thank you for continuing to look at this, @MarkEdmondson1234 ! I look forward to hearing if you get confirmation and are able to update the function as you note.

allen-black avatar Mar 17 '20 14:03 allen-black

Maciej Franas has this work around if you need more than 10k a day:

For anyone else that would be looking for an answer in the future (hello, time-traveler) here's the answer


for(i in 1:length(z)){
  day <- as.list(rep(0, 24))
  for(j in 1:length(hours)){
    day[[j]] <- google_analytics(view,
                                 date_range =c(z[i], z[i]),
                                 dimensions = c('ga:deviceCategory',
                                                 'ga:clientId', 
                                                 'ga:hour'),
                                 metrics = c('ga:users'), 
                                 filtersExpression = paste0('ga:users>0;ga:hour==',hours[j]),
                                 anti_sample = T
    )
  }
  output[[i]]<- do.call('rbind', day)
}

where 'z' is dates vector and 'hours' is vector with 2-digits string hours (e.g. '01', '02'). This is obviously painfully slow and most likely will result in exhausting API limits but nothing else works - at least for now.

MarkEdmondson1234 avatar May 27 '20 10:05 MarkEdmondson1234

Thanks to the work around for this!

mgahan-he avatar Aug 23 '20 14:08 mgahan-he

Anyone try this lately? I have tried it a couple of times tonight and I have not been running into the limit. I wonder if this has been magically fixed!!

mgahan avatar Dec 11 '20 08:12 mgahan

Anyone try this lately? I have tried it a couple of times tonight and I have not been running into the limit. I wonder if this has been magically fixed!!

Agreed, I'm no longer hitting the 10,000 row limit when pulling clientId. No official update published in the API changelog though: https://developers.google.com/analytics/devguides/changelog

RhysJackson avatar Jan 14 '21 12:01 RhysJackson

I notice that I still run into the limit at certain times - I don’t have a great feel for when it works and when it doesn’t.

mgahan-he avatar Jan 14 '21 14:01 mgahan-he

@everleazy is still getting the limit:

Hello. Sorry in advance - I use Google translator. =)

Faced a problem - I can't get more than 10,000 rows when requesting data from a client. There are about 30,000 users in Analytics per day. When I try to query in 1 day, I get exactly 10,000 rows.

Tried with the parameter "max" (-1), anti_sample. I still only get 10,000 lines.

MarkEdmondson1234 avatar Feb 09 '21 08:02 MarkEdmondson1234

I am finding that I get limited out if I pull today or yesterday’s client IDs.

But if I pull 2 or more days in the past, it seems to work great. Not sure why this happens.

mgahan avatar Feb 09 '21 14:02 mgahan

Hi!

I've always downloaded client id and got limitless results, but suddenly I just got 10,000 lines. Someone has figured out why it happens?

DC-A avatar Feb 24 '21 10:02 DC-A

For the past 9 months or so, I have been able to pull clientId without much problem. However, starting yesterday, it appears like I am hitting the 10k limit once again.

mgahan-he avatar Aug 27 '21 13:08 mgahan-he

For the past 9 months or so, I have been able to pull clientId without much problem. However, starting yesterday, it appears like I am hitting the 10k limit once again.

I have same problem, and one interesting thing, when remove from my code "campaign", code work just fine. is anyone find what is the problem?

PedjaV avatar Sep 06 '21 14:09 PedjaV

@PedjaV Great find! Just tried without campaign code and it appears to work well. Bummer I cant get campaign code but easier than pulling everything by the hour for now.

mgahan-he avatar Sep 06 '21 20:09 mgahan-he

I still can't get more than 10k rows when using just ga:date, ga:clientId and ga:hits and the date range is set to just 2 days. Do I understand correctly, this is still a bug in the Core Reporting API and the only real workaround is to use an hour-based filtering?

wilfreddesert avatar Nov 10 '21 08:11 wilfreddesert

Yes. Its not really a bug more a beta feature that may or may not be supported in the future. Best is to put your client.id into a custom dimension and pull that instead.

MarkEdmondson1234 avatar Nov 10 '21 08:11 MarkEdmondson1234

Thank you very much. Can I please ask you a question about GA while we're here since you definitely have plenty of experience working with it. I am working with Core Reporting API v4 at the moment.

The idea is to calculate a metric (CAC) where ideally you have an idea of the average time difference between the first hit and the first purchase. Registered users have a special ID assigned and logged in a customer dimension. In order to calculate, say, the 95th percentile of the first-touch-first-purchase sequence length, I need to find the first hit date for each user. Prior to registration, there is only clientId. So the idea I had in mind was to get all user identifiers as of the purchase date and then find the first entry with that user identifier.

For this entry, we can further get the clientId and try to extend the timeline as far to the right as possible until it's either the first hit or just the last time this user's clientId changed for some reason.

Do you think this is a viable idea? I am worried that extracting all the hits might get me banned by Google pretty soon this way :)

wilfreddesert avatar Nov 10 '21 08:11 wilfreddesert

One fact that might be helpful and allow you to pull less detailed data.

GA1.1.904941809.1556093647

GA1 is a universal prefix of all cookies in this format.

The number 1 indicates the level of the domain: in this case, the top-level domain. All that comes after the second period is the Client ID.

The first part is a randomly generated number, and the second part (after the third period) is the creation time of the cid in UNIX format.

So within the clientId itself, you have the timestamp when the cookie was initially created. So if you have the registration time stamp and the clientId, you have enough info to calculate time from cookie creation to registration.

mgahan-he avatar Nov 10 '21 08:11 mgahan-he

Yes as @mgahan-he says you have the timestamp of the first hit in the cookieId itself.

I am worried that extracting all the hits might get me banned by Google pretty soon this way :)

There is no risk of being banned, just keep under the quotas per day. Look to authenticating under your own client.id if pulling a lot of data as mentioned on the website https://code.markedmondson.me/googleAnalyticsR/articles/setup.html

MarkEdmondson1234 avatar Nov 10 '21 08:11 MarkEdmondson1234

Sounds cool, thank you so much. So this means that whenever I have the first login, I can check the clientId for that row and essentially infer the first hit date. This doesn't help though with the fact that the cookie might have been deleted or the user could have used another device before so this first hit date could be off. But I guess there is nothing reliable to take account of that, right?

wilfreddesert avatar Nov 10 '21 08:11 wilfreddesert

Yep users are actually only distinct cookies, and its not that reliable especially these days with browser protections. For instance on Apple devices you may only have a 7 day window due to ITP.

MarkEdmondson1234 avatar Nov 10 '21 08:11 MarkEdmondson1234