Scoop icon indicating copy to clipboard operation
Scoop copied to clipboard

[Feature] useGithubAPI by default when using github checkver

Open Lutra-Fs opened this issue 1 month ago • 17 comments

Feature Request

Is your feature request related to a problem? Please describe.

Scoop's GitHub checkver currently has inconsistent behavior that relies on fragile HTML scraping. When using the GitHub matcher with additional properties like regex, it falls back to parsing HTML pages instead of using the GitHub API, leading to:

  • Brittle regex patterns that match against HTML URLs instead of clean version strings
  • Performance issues due to downloading full HTML pages vs compact JSON API responses
  • Maintenance burden as manifest authors must understand GitHub's HTML structure
  • Reliability problems when GitHub changes their HTML layout, breaking existing manifests

Describe the solution you'd like

Modify Scoop's checkver logic to use GitHub API by default when using the GitHub matcher, regardless of additional properties. The proposed solution would:

  1. Enable jsonpath with GitHub matcher: Allow jsonpath to work with github property
  2. Default to API mode: Use GitHub API for all GitHub matcher usage, not just simple cases
  3. Simplify regex patterns: Extract version directly from tag_name field, then clean with simple regex

Examples of improved manifest structure:

  // Current (HTML scraping)
  "checkver": {
      "github": "https://github.com/wez/atomicparsley",
      "regex": "/releases/tag/(\\d+\\.\\d+\\.[a-f0-9]+)"
  }

  // Proposed (API + jsonpath)
  "checkver": {
      "github": "https://github.com/wez/atomicparsley",
      "jsonpath": "$.tag_name",
      "regex": "([\\d]+\\.[\\d]+\\.[\\d]+\\.[a-f0-9]+)"
  }

  // Simple cases remains the same
  "checkver": {
      "github": "https://github.com/owner/repo"
  }

We can also share the result between checkver and hash extraction, since it is using github API for github assets now.

Describe alternatives you've considered

One option could be introducing more controls within the checkver property itself. However, in most cases, it's primarily used to match a tag, so additional flexibility may not offer substantial benefit.

Lutra-Fs avatar Nov 22 '25 13:11 Lutra-Fs

cc: @z-Fng @stevehipwell

Lutra-Fs avatar Nov 22 '25 13:11 Lutra-Fs

@Lutra-Fs I've been thinking about this same problem and came to a very similar conclusion. I think my only concern would be backwards compatibility, I was thinking of a new primary key like githubjson. But with your proposal the initial implementation could keep the existing behaviour if a regex but no jsonpath was specified?

stevehipwell avatar Nov 22 '25 13:11 stevehipwell

Scoop's GitHub checkver currently has inconsistent behavior that relies on fragile HTML scraping. When using the GitHub matcher with additional properties like regex, it falls back to parsing HTML pages instead of using the GitHub API.

According to the code, checkver uses the GitHub API in the following situations:

  1. The checkver field is set to the string "github". e.g.,
"checkver" : "github"
  1. The checkver field contains only a "github" subkey. (Restriction not working) e.g.,
"checkver" : [
    "github": "https://github.com/..."
]
  1. URL matches the GitHub API regex pattern. e.g.,
"checkver" : [
    "url": "https://api.github.com/..."
]

I'm not sure exactly which situation you are referring to. Are you referring to the second case?

If so, in my tests, it seems that the restriction -- where the checkver field contains only a "github" subkey -- does not take effect. In other words, in practice, the GitHub API is being used in most cases. And the current regex works just fine.

We may need to check with @rashil2000 (https://github.com/ScoopInstaller/Scoop/pull/4557) and remove this logic if we confirm it is unnecessary.

https://github.com/ScoopInstaller/Scoop/blob/b588a06e41d920d2123ec70aee682bae14935939/bin/checkver.ps1#L159-L163

e.g., https://github.com/ScoopInstaller/Extras/blob/master/bucket/jstock.json

    "checkver": {
        "github": "https://github.com/yccheok/jstock/",
        "regex": "release_(\\d+)-(\\d+)-(\\d+)-(\\d+)",
        "replace": "${1}.${2}.${3}.${4}"
    },
Image Image

cc: @rashil2000

z-Fng avatar Nov 22 '25 22:11 z-Fng

I think using JSON path would be easier than our approach by using regex releases/tag/?

Lutra-Fs avatar Nov 22 '25 22:11 Lutra-Fs

Error log:

    "checkver": {
        "github": "https://github.com/wez/atomicparsley",
        "jsonpath": "$.tag_name",
        "regex": "([\\d]+\\.[\\d]+\\.[\\d]+\\.[a-f0-9]+)"
    },
main on  github-checkver [✘!?]
❯ .\bin\checkver.ps1 atomicparsley -u -f
ConvertFrom-Json: D:\Applications\Scoop\apps\scoop\current\lib\json.ps1:130
Line |
 130 |      $result = $json | ConvertFrom-Json -ea stop
     |                        ~~~~~~~~~~~~~~~~~~~~~~~~~
     | Conversion from JSON failed with error: Unexpected character encountered while parsing value: <. Path '', line
     | 0, position 0.
atomicparsley: couldn't find '$.tag_name' in https://github.com/wez/atomicparsley/releases/latest

So I checked and assumed if ($json.checkver.PSObject.Properties.Count -eq 1) { $useGithubAPI = $true } is suspicious.

We can always go back to using checkver with api.github.com at url, but I think providing a unified approach should be better.

Lutra-Fs avatar Nov 22 '25 23:11 Lutra-Fs

Did you configure a GitHub token?

https://github.com/ScoopInstaller/Scoop/blob/b588a06e41d920d2123ec70aee682bae14935939/bin/checkver.ps1#L221-L224

It works fine on my side.

PS> jq '{checkver}' .\bucket\atomicparsley.json                                                                                           
{
  "checkver": {
    "github": "https://github.com/wez/atomicparsley",
    "jsonpath": "$.tag_name",
    "regex": "([\\d]+\\.[\\d]+\\.[\\d]+\\.[a-f0-9]+)"
  }
}
PS> .\bin\checkver.ps1 -App atomicparsley -f                                                                                              
https://api.github.com/repos/wez/atomicparsley/releases/latest
atomicparsley: couldn't match '([\d]+\.[\d]+\.[\d]+\.[a-f0-9]+)' in https://api.github.com/repos/wez/atomicparsley/releases/latest

z-Fng avatar Nov 22 '25 23:11 z-Fng

I think I might misconfigure it with the wrong config name ;; checking it now

Lutra-Fs avatar Nov 22 '25 23:11 Lutra-Fs

The behavior when no GitHub token is configured is inconsistent with when a token is set, and there's not even any related logs. This is very counterintuitive for contributors. Should we consider changing this behavior?

z-Fng avatar Nov 22 '25 23:11 z-Fng

We can also share the result between checkver and hash extraction, since it is using github API for github assets now.

I completely agree with this. When refactoring webpage fetching for hash extraction in #6535, I also planned to cache the results to avoid hitting the GitHub API multiple times.

z-Fng avatar Nov 22 '25 23:11 z-Fng

I think a warning might need to be added.

Lutra-Fs avatar Nov 22 '25 23:11 Lutra-Fs

I am currently debugging why the regex does not work

Lutra-Fs avatar Nov 22 '25 23:11 Lutra-Fs

Looks like the regex has one unnecessary group. It should be ([\d]+.[\d]+.~~[\d]+.~~[a-f0-9]+)

P.S, More debug logs are needed; Scoop is really hard to debug.

z-Fng avatar Nov 22 '25 23:11 z-Fng

Yeah, I really agree with that. I have fixed that either

Lutra-Fs avatar Nov 22 '25 23:11 Lutra-Fs

I am currently thinking on reuse the API result between checkver and autoupdate hash extraction, depending on how we go with https://github.com/ScoopInstaller/GithubActions/issues/64 and https://github.com/ScoopInstaller/Scoop/pull/6535. If we keep hitting the rate limit, I think a reuse here could be better, since it is containing calculated digest as well.

Lutra-Fs avatar Nov 22 '25 23:11 Lutra-Fs

I've taken a look at the current behaviour and it looks incorrect to me for a number of reasons. But the primary issue in this context is the use of the /releases/tag/ prefix meaning that if you're using the API you're matching on the html_url value which also means that some characters will be escaped (e.g. the + character in the K0s tag). This makes no logical sense as the tag_name can be directly accessed and a simple regex can then be used.

The other obvious issue with the code is that it's running code for all potential cases without short circuiting. It also runs the least specific code before looking for more specific configuration. I'd expect to see conditional branching based on the checkver input which would be clearer to read and faster.

That all said by including a regex (or jsonpath) value the code skips calling the API as there are more than 1 keys under checkver. I can't see a logical reason why this behaviour is required as if we use the GitHub API directly in the URL it would work correctly?

I'm not sure if the current implementation can be modified/fixed, but the default spec when using the API should be equivalent to the following.

{
  "checkver": {
    "github": "https://github.com/org/repo",
    "jsonpath": "$.tag_name",
    "regex": "v(.+)"
  }
}

Or if this needs to be a new behaviour, maybe something like this.

{
  "checkver": {
    "github": "https://github.com/org/repo",
    "githubapi": true,
    "jsonpath": "$.tag_name",
    "regex": "v(.+)"
  }
}

stevehipwell avatar Nov 25 '25 13:11 stevehipwell

That all said by including a regex (or jsonpath) value the code skips calling the API as there are more than 1 keys under checkver. I can't see a logical reason why this behaviour is required as if we use the GitHub API directly in the URL it would work correctly?

As I mentioned above, there's a bug here -- this limitation doesn't really exist. It will still switch to GitHub API even when there are more than 1 key under checkver.

PS> jq '{checkver}' .\bucket\k0s.jsonpath
{
  "checkver": {
    "github": "https://github.com/k0sproject/k0s",
    "jsonpath": "$.tag_name",
    "regex": "v(\\d+\\.\\d+\\.\\d+\\+k0s\\.\\d+)"
  }
}

PS> .\bin\checkver.ps1 -App k0s -f
https://api.github.com/repos/k0sproject/k0s/releases/latest
k0s: 1.34.2+k0s.0 (scoop version is 1.34.2+k0s.0)
Forcing autoupdate!
Autoupdating k0s
Could not find hash in https://api.github.com/repos/k0sproject/k0s/releases
Downloading k0s-v1.34.2 k0s.0-amd64.exe to compute hashes!
Loading k0s-v1.34.2+k0s.0-amd64.exe from cache
Computed hash: 934874c43314e96eb8bab21b40c851913e7cf4597a1632d350580ed195fd4656
Writing updated k0s manifest

In fact, it wasn't until #4557 that Scoop switched to using the GitHub API to retrieve version numbers. My guess is that it was probably added just to avoid breaking the original behavior. Originally, the GitHub mode regular expression matched version numbers by only recognizing digits and periods (/releases/tag/(?:v|V)?([\d.]+)) from Github page. These characters are not escaped even in URLs returned by the REST API, so the original regular expression was simply reused.

This is only my assumption for now -- we'll still need confirmation from @rashil2000 .

z-Fng avatar Nov 26 '25 00:11 z-Fng

I think the primary issue is that when gh_token is not configured, the checkver logic falls back to HTML parsing instead of utilizing the GitHub API. Since the GitHub API supports unauthenticated requests (albeit with lower rate limits), we should default to using the API regardless of whether gh_token is set. Aligned with @z-Fng mentioned earlier.

This inconsistency is particularly problematic because our CI environment has gh_token configured, which means the behavior contributors experience locally (without tokens) differs from what runs in CI. This creates confusion and makes it harder to debug issues, as the two environments execute different code paths.

This reminds me of the railroad gauge story: the standard railroad track width of 4 feet 8.5 inches originated from the width of old horse-drawn wagons, which in turn came from ancient Roman chariot ruts. We carried forward a measurement from Roman times simply because "that's how it's always been done."

Similarly, maintaining backward compatibility with HTML parsing just because it was the original implementation can lock us into suboptimal design decisions. While backward compatibility has value, we shouldn't let it prevent us from adopting better solutions - especially when the GitHub API is more reliable, maintainable, and already available for anonymous use.

Sometimes the right path forward means breaking from legacy approaches rather than perpetually accommodating them. The API-first approach is objectively better: it's less fragile, officially supported, and actually unifies behaviour across authenticated and unauthenticated scenarios rather than creating divergent code paths.

One other issue is that we are not sure about how and why if ($json.checkver.PSObject.Properties.Count -eq 1) { $useGithubAPI = $true } works. or maybe we should write $useGithubAPI=$true without extra checks.

I am currently busy this week and might reply later, and I will have a detailed look at the checkver module probably next week, and think on how we should touch it.

Lutra-Fs avatar Nov 26 '25 01:11 Lutra-Fs