checkout-files icon indicating copy to clipboard operation
checkout-files copied to clipboard

Consider using the Git Data API to support larger files

Open JamesMGreene opened this issue 3 years ago • 5 comments

During a recent workflow run, I attempted to use the checkout-files Action to checkout a fairly large package-lock.json file (npm lockfile) weighing in at about 2 MB. I received this unhandled promise rejection:

Run Bhacaz/checkout-files@c8f01756bfd894ba746d5bf48205e19000b0742b
(node:1939) UnhandledPromiseRejectionWarning: HttpError: This API returns blobs up to 1 MB in size. The requested blob is too large to fetch via the API, but you can use the Git Data API to request blobs up to 100 MB in size.: {"resource":"Blob","field":"data","code":"too_large"}
    at /home/runner/work/_actions/Bhacaz/checkout-files/c8f01756bfd894ba746d5bf48205e19000b0742b/node_modules/@octokit/request/dist-node/index.js:66:23
    at processTicksAndRejections (internal/process/task_queues.js:93:5)

You should consider using the Git Data API to support downloading larger files. 📦

JamesMGreene avatar Sep 22 '21 21:09 JamesMGreene

How do we get the value of file_sha for https://docs.github.com/en/rest/reference/git#get-a-blob?

The current action uses getContent(), which accepts the file path.

sun avatar Oct 18 '21 17:10 sun

I had to go a little digging to explore that question! I've identified at least 2 viable ways.

Given that both of these approaches result in extra API calls, it might also be worthwhile to keep the current approach as the primary one, and only utilize this secondary approach if the request fails with a 403 status code. That would mean a bit more code to maintain but is probably the most optimal approach for most use cases. 🤷🏻

Combining the Repository Contents and Git Data APIs

  1. Reduce the list of requested file paths into a unique list of file path parent directories
  2. Get the repo contents for each parent directory path (instead of for each file path)
  3. In each response, find those files with matching path values and grab their sha property (or just the full GitHub Data Blob API URL from the _links.git property if you don't want to dynamically build the URL)
  4. Get a blob for each entry, still converting the responses from base64 as current

Using only the Git Data API

  1. Get a tree for the current branch/sha
  • Would require PR #6, or analysis of the GitHub Actions event data to get the branch/sha/repository default_branch (or PR base branch, perhaps?)
  • If any of the requested file paths are not in the root directory, then you must add the ?recursive=true query param, or else make multiple queries to get individual trees based on the first response (especially if the response has truncated: true)
  1. For the entries in the response's tree array, find those with matching path values and grab their sha property (or just the full GitHub Data Blob API URL from the url property if you don't want to dynamically build the URL)
  2. Get a blob for each entry, still converting the responses from base64 as current

JamesMGreene avatar Nov 17 '21 21:11 JamesMGreene

Here's an updated and easier option! 🎉

https://github.blog/changelog/2022-05-03-increased-file-size-limit-when-retrieving-file-contents-via-rest-api/

TL;DR: Keep doing things as you are today, but just set this custom media type on the file retrieval request headers:

Accept: application/vnd.github.v3.raw

JamesMGreene avatar Jun 01 '22 15:06 JamesMGreene

Created PR: https://github.com/Bhacaz/checkout-files/pull/9

JamesMGreene avatar Jun 01 '22 15:06 JamesMGreene

@sun + @Bhacaz, I have opened a PR to utilize the raw endpoint that @JamesMGreene mentioned: https://github.com/Bhacaz/checkout-files/pull/23

Any interest in merging this into the action? If not, I will promote this to my own public action.

Thanks!

jordanmnunez avatar Feb 10 '23 02:02 jordanmnunez