[Feature] useGithubAPI by default when using github checkver
Feature Request
Is your feature request related to a problem? Please describe.
Scoop's GitHub checkver currently has inconsistent behavior that relies on fragile HTML scraping. When using the GitHub matcher with additional properties like regex, it falls back to parsing HTML pages instead of using the GitHub API, leading to:
- Brittle regex patterns that match against HTML URLs instead of clean version strings
- Performance issues due to downloading full HTML pages vs compact JSON API responses
- Maintenance burden as manifest authors must understand GitHub's HTML structure
- Reliability problems when GitHub changes their HTML layout, breaking existing manifests
Describe the solution you'd like
Modify Scoop's checkver logic to use GitHub API by default when using the GitHub matcher, regardless of additional properties. The proposed solution would:
- Enable
jsonpathwith GitHub matcher: Allowjsonpathto work withgithubproperty - Default to API mode: Use GitHub API for all GitHub matcher usage, not just simple cases
- Simplify regex patterns: Extract version directly from tag_name field, then clean with simple regex
Examples of improved manifest structure:
// Current (HTML scraping)
"checkver": {
"github": "https://github.com/wez/atomicparsley",
"regex": "/releases/tag/(\\d+\\.\\d+\\.[a-f0-9]+)"
}
// Proposed (API + jsonpath)
"checkver": {
"github": "https://github.com/wez/atomicparsley",
"jsonpath": "$.tag_name",
"regex": "([\\d]+\\.[\\d]+\\.[\\d]+\\.[a-f0-9]+)"
}
// Simple cases remains the same
"checkver": {
"github": "https://github.com/owner/repo"
}
We can also share the result between checkver and hash extraction, since it is using github API for github assets now.
Describe alternatives you've considered
One option could be introducing more controls within the checkver property itself. However, in most cases, it's primarily used to match a tag, so additional flexibility may not offer substantial benefit.
cc: @z-Fng @stevehipwell
@Lutra-Fs I've been thinking about this same problem and came to a very similar conclusion. I think my only concern would be backwards compatibility, I was thinking of a new primary key like githubjson. But with your proposal the initial implementation could keep the existing behaviour if a regex but no jsonpath was specified?
Scoop's GitHub checkver currently has inconsistent behavior that relies on fragile HTML scraping. When using the GitHub matcher with additional properties like regex, it falls back to parsing HTML pages instead of using the GitHub API.
According to the code, checkver uses the GitHub API in the following situations:
- The checkver field is set to the string "github". e.g.,
"checkver" : "github"
- The checkver field contains only a "github" subkey. (Restriction not working) e.g.,
"checkver" : [
"github": "https://github.com/..."
]
- URL matches the GitHub API regex pattern. e.g.,
"checkver" : [
"url": "https://api.github.com/..."
]
I'm not sure exactly which situation you are referring to. Are you referring to the second case?
If so, in my tests, it seems that the restriction -- where the checkver field contains only a "github" subkey -- does not take effect. In other words, in practice, the GitHub API is being used in most cases. And the current regex works just fine.
We may need to check with @rashil2000 (https://github.com/ScoopInstaller/Scoop/pull/4557) and remove this logic if we confirm it is unnecessary.
https://github.com/ScoopInstaller/Scoop/blob/b588a06e41d920d2123ec70aee682bae14935939/bin/checkver.ps1#L159-L163
e.g., https://github.com/ScoopInstaller/Extras/blob/master/bucket/jstock.json
"checkver": {
"github": "https://github.com/yccheok/jstock/",
"regex": "release_(\\d+)-(\\d+)-(\\d+)-(\\d+)",
"replace": "${1}.${2}.${3}.${4}"
},
cc: @rashil2000
I think using JSON path would be easier than our approach by using regex releases/tag/?
Error log:
"checkver": {
"github": "https://github.com/wez/atomicparsley",
"jsonpath": "$.tag_name",
"regex": "([\\d]+\\.[\\d]+\\.[\\d]+\\.[a-f0-9]+)"
},
main on github-checkver [✘!?]
❯ .\bin\checkver.ps1 atomicparsley -u -f
ConvertFrom-Json: D:\Applications\Scoop\apps\scoop\current\lib\json.ps1:130
Line |
130 | $result = $json | ConvertFrom-Json -ea stop
| ~~~~~~~~~~~~~~~~~~~~~~~~~
| Conversion from JSON failed with error: Unexpected character encountered while parsing value: <. Path '', line
| 0, position 0.
atomicparsley: couldn't find '$.tag_name' in https://github.com/wez/atomicparsley/releases/latest
So I checked and assumed if ($json.checkver.PSObject.Properties.Count -eq 1) { $useGithubAPI = $true } is suspicious.
We can always go back to using checkver with api.github.com at url, but I think providing a unified approach should be better.
Did you configure a GitHub token?
https://github.com/ScoopInstaller/Scoop/blob/b588a06e41d920d2123ec70aee682bae14935939/bin/checkver.ps1#L221-L224
It works fine on my side.
PS> jq '{checkver}' .\bucket\atomicparsley.json
{
"checkver": {
"github": "https://github.com/wez/atomicparsley",
"jsonpath": "$.tag_name",
"regex": "([\\d]+\\.[\\d]+\\.[\\d]+\\.[a-f0-9]+)"
}
}
PS> .\bin\checkver.ps1 -App atomicparsley -f
https://api.github.com/repos/wez/atomicparsley/releases/latest
atomicparsley: couldn't match '([\d]+\.[\d]+\.[\d]+\.[a-f0-9]+)' in https://api.github.com/repos/wez/atomicparsley/releases/latest
I think I might misconfigure it with the wrong config name ;; checking it now
The behavior when no GitHub token is configured is inconsistent with when a token is set, and there's not even any related logs. This is very counterintuitive for contributors. Should we consider changing this behavior?
We can also share the result between checkver and hash extraction, since it is using github API for github assets now.
I completely agree with this. When refactoring webpage fetching for hash extraction in #6535, I also planned to cache the results to avoid hitting the GitHub API multiple times.
I think a warning might need to be added.
I am currently debugging why the regex does not work
Looks like the regex has one unnecessary group. It should be ([\d]+.[\d]+.~~[\d]+.~~[a-f0-9]+)
P.S, More debug logs are needed; Scoop is really hard to debug.
Yeah, I really agree with that. I have fixed that either
I am currently thinking on reuse the API result between checkver and autoupdate hash extraction, depending on how we go with https://github.com/ScoopInstaller/GithubActions/issues/64 and https://github.com/ScoopInstaller/Scoop/pull/6535. If we keep hitting the rate limit, I think a reuse here could be better, since it is containing calculated digest as well.
I've taken a look at the current behaviour and it looks incorrect to me for a number of reasons. But the primary issue in this context is the use of the /releases/tag/ prefix meaning that if you're using the API you're matching on the html_url value which also means that some characters will be escaped (e.g. the + character in the K0s tag). This makes no logical sense as the tag_name can be directly accessed and a simple regex can then be used.
The other obvious issue with the code is that it's running code for all potential cases without short circuiting. It also runs the least specific code before looking for more specific configuration. I'd expect to see conditional branching based on the checkver input which would be clearer to read and faster.
That all said by including a regex (or jsonpath) value the code skips calling the API as there are more than 1 keys under checkver. I can't see a logical reason why this behaviour is required as if we use the GitHub API directly in the URL it would work correctly?
I'm not sure if the current implementation can be modified/fixed, but the default spec when using the API should be equivalent to the following.
{
"checkver": {
"github": "https://github.com/org/repo",
"jsonpath": "$.tag_name",
"regex": "v(.+)"
}
}
Or if this needs to be a new behaviour, maybe something like this.
{
"checkver": {
"github": "https://github.com/org/repo",
"githubapi": true,
"jsonpath": "$.tag_name",
"regex": "v(.+)"
}
}
That all said by including a
regex(orjsonpath) value the code skips calling the API as there are more than 1 keys undercheckver. I can't see a logical reason why this behaviour is required as if we use the GitHub API directly in the URL it would work correctly?
As I mentioned above, there's a bug here -- this limitation doesn't really exist. It will still switch to GitHub API even when there are more than 1 key under checkver.
PS> jq '{checkver}' .\bucket\k0s.jsonpath
{
"checkver": {
"github": "https://github.com/k0sproject/k0s",
"jsonpath": "$.tag_name",
"regex": "v(\\d+\\.\\d+\\.\\d+\\+k0s\\.\\d+)"
}
}
PS> .\bin\checkver.ps1 -App k0s -f
https://api.github.com/repos/k0sproject/k0s/releases/latest
k0s: 1.34.2+k0s.0 (scoop version is 1.34.2+k0s.0)
Forcing autoupdate!
Autoupdating k0s
Could not find hash in https://api.github.com/repos/k0sproject/k0s/releases
Downloading k0s-v1.34.2 k0s.0-amd64.exe to compute hashes!
Loading k0s-v1.34.2+k0s.0-amd64.exe from cache
Computed hash: 934874c43314e96eb8bab21b40c851913e7cf4597a1632d350580ed195fd4656
Writing updated k0s manifest
In fact, it wasn't until #4557 that Scoop switched to using the GitHub API to retrieve version numbers. My guess is that it was probably added just to avoid breaking the original behavior. Originally, the GitHub mode regular expression matched version numbers by only recognizing digits and periods (/releases/tag/(?:v|V)?([\d.]+)) from Github page. These characters are not escaped even in URLs returned by the REST API, so the original regular expression was simply reused.
This is only my assumption for now -- we'll still need confirmation from @rashil2000 .
I think the primary issue is that when gh_token is not configured, the checkver logic falls back to HTML parsing instead of utilizing the GitHub API. Since the GitHub API supports unauthenticated requests (albeit with lower rate limits), we should default to using the API regardless of whether gh_token is set. Aligned with @z-Fng mentioned earlier.
This inconsistency is particularly problematic because our CI environment has gh_token configured, which means the behavior contributors experience locally (without tokens) differs from what runs in CI. This creates confusion and makes it harder to debug issues, as the two environments execute different code paths.
This reminds me of the railroad gauge story: the standard railroad track width of 4 feet 8.5 inches originated from the width of old horse-drawn wagons, which in turn came from ancient Roman chariot ruts. We carried forward a measurement from Roman times simply because "that's how it's always been done."
Similarly, maintaining backward compatibility with HTML parsing just because it was the original implementation can lock us into suboptimal design decisions. While backward compatibility has value, we shouldn't let it prevent us from adopting better solutions - especially when the GitHub API is more reliable, maintainable, and already available for anonymous use.
Sometimes the right path forward means breaking from legacy approaches rather than perpetually accommodating them. The API-first approach is objectively better: it's less fragile, officially supported, and actually unifies behaviour across authenticated and unauthenticated scenarios rather than creating divergent code paths.
One other issue is that we are not sure about how and why if ($json.checkver.PSObject.Properties.Count -eq 1) { $useGithubAPI = $true } works. or maybe we should write $useGithubAPI=$true without extra checks.
I am currently busy this week and might reply later, and I will have a detailed look at the checkver module probably next week, and think on how we should touch it.