Requesting GitHub activity for a timeframe of multiple years won't return all the data
When I tried to call the get_activity function with for instance since set to 2014 and until to 2017, the returned dataframe contained only a part of the real data which should be returned for this time interval.
As a workaround, I wrote some code that divides a time interval in multiple one-year intervals, passes each of those to get_activity and at the end, the resulting dataframes are merged.
I do not think this is a limitation of the github-activity tool, but rather a limitation of the GitHub graphql infrastructure, but which unfortunately affects this tool too if a user tries to pull activity spanning over multiple years.
If you can reproduce this issue on your own setup and consider it is a good idea to have this integrated into this repo, let me know, I'd love to submit a PR.
Ahhh interesting - so GraphQL limits the amount of data returned, or is it specifically a cap on "one year"? I agree that it would be helpful to have a function that can manage this process. Even better if either:
- The functionality is automatically called under the hood if a user makes a query that is going to be too much data
- The default behavior raises a warning or an error if the user makes a long query, and directs them to use a different function for this
@choldgraf I don't know if there is a documented limit on what GraphQL wants/is able to return, I'll research more. One year is what I discovered works for the org/repo I was running queries against. If there is indeed a data limit, this one-year rule won't work well for every org/repo (they have different number of PRs, issues, thus response size might differ from repo to repo).
I already have some code that does this behind the scenes when the user specifies a longer interval (https://github.com/robertbindar/mariadb.org-tools/blob/master/reporting/process_github_activity.py#L26)
But because of wanting to do this fast, I came up with this one-year limit which I think is not the best idea, even for MariaDB/server (if in year 2020 we will have more PRs than GraphQL can return in one chunk, script won't work properly).
If we manage to discover in GraphQL docs this data limit (or maybe even succeed coming up with a hack to determine this limit dinamically from code) the year-by-year ($limit-by-$limit) trick I have there can be easily adapted to be integrated in github-activity.
[Update] Resource throttling seems to be documented here: https://developer.github.com/v4/guides/resource-limitations/