checkup GitHub storage fails when content is >1MB

The GitHub Contents API has a limit which causes file fetch, creation, and updates to fail if the content is > 1 MB in size. I have been hitting this recently:

2018/11/12 22:00:06 github: creating updates/1542060006221540759-check.json on branch 'master'
2018/11/12 22:00:07 GET https://api.github.com/repos/parkr/status/contents/updates/index.json?ref=heads%2Fmaster: 403 This API returns blobs up to 1 MB
+in size. The requested blob is too large to fetch via the API, but you can use the Git Data API to request blobs up to 100 MB in size. [{Resource:Blob
+Field:data Code:too_large Message:}]

The Git Data API requires creating the blob, tree, and commit objects manually, but provides a much more robust means of dealing with larger data. We should migrate the GitHub notifier to use this method instead.

Opening this in case someone else has time to do this before I do.

Nov 13 '18 18:11 parkr

Seeing if the Google folks would be interested in streamlining this on their side. It's presently a significant amount of code to use the Git API! https://github.com/google/go-github/issues/1052

Nov 14 '18 14:11 parkr

Any updates on this? Is this still relevant?

Apr 25 '20 14:04 DanielRuf

I still hit this occasionally, and have to remove the index to fix.

Apr 25 '20 15:04 parkr

Do you know if this will happen for 3 services with a simple http check for every 10 minutes? Or can you say in which cases this could occur (like x watched services / URLs with an interval of x time)? Not that I hit this every time.

Apr 25 '20 16:04 DanielRuf

@DanielRuf This will eventually occur for all configurations using the GitHub storage backend, but doing fewer checks with fewer services will increase the lead time. I have 11 services checked every 30 minutes and it occurs every few months for me.

One automated way to fix this in the GitHub storage backend is to take the bytesize of the index, and trim it down to < 1MB on every attempt to save.

Apr 25 '20 17:04 parkr

Sounds like we need some data rentention feature here (remove old entries / files).

Apr 25 '20 18:04 DanielRuf

This appears to only occur when the index itself gets too large, so we could certainly prune old entries in the index such that the serialized JSON is always <1MB.

The status page that I have only ever shows 24 hours, so we could also limit the index to the last 24 hours.

Jun 26 '20 15:06 parkr

@parkr what are your expiry settings for the GH storage? The index should be cleaned up on maintain calls, depending on what you have in check_expiry. If this is zero/unset, your index won't get cleaned up, even if it does grow to be larger than GH limits.

Jul 20 '20 16:07 titpetric

That being said - it's a work around. Regardless of what you set in check_expiry, the index must be limited to ~1MB by size, which does mean that it would drop/delete data after some time. Are you storing your checks (historically) forever, or do you delete those too when you recreate the index?

Jul 20 '20 16:07 titpetric