license-list-XML
license-list-XML copied to clipboard
Some pages have 'no longer live' next to OSI page links, but the links are not dead
Example: https://spdx.org/licenses/GPL-3.0-only.html
This has [no longer live]
next to the opensource.org link for the license, but that text does not appear in the source file for the page, and the link is not dead (it works fine).
The links are automatically checked by the licenselistpublisher. When I get some time, I'll investigate why the test is failing. The code that does this check is here: https://github.com/spdx/LicenseListPublisher/blob/fd6416916438e6a50b579ac68a313b760c8d8f4c/src/org/spdx/crossref/Live.java#L46
Update: When I run the LicenseListPublisher locally, it finds reports the link live. But it looks like when it is run with the github action, it shows the link as not accessible.
From a short sampling of the licenses containing cross references with isLive = false
, it looks like all OSI web pages are listed as not live.
Based on this analysis, it looks like there is something unique about the Github action environment that is causing the non-response on the HTTPS request.
@kpfleming Do you happen to know anyone at OSI that may have information on any website protections that may be causing this behavior?
We could do a "hack" and allow all OSI web pages to be "live" even if they may really be dead links. I would be much more comfortable, however, if we could identify the root cause of this issue and resolve it.
Possibly @smaffulli could help out here.
@smaffulli - The code sets the following as the request header to fake a browser request:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
Is there anything else the OSI website may be looking for to validate the request? It's interesting this runs fine on my local machine but fails when running under a Github action.
The test will fail if a response code other than 200, 304, 301, or 302 is returned. It will also fail if the response times out after 8.5 seconds.
thanks @kpfleming for noticing this and @goneall for looking into it. I did notice it was also doing this on GPL-3.0-or-later but didn't go far enough to realize it was all of them! Would be good to have this fixed at the OSI level for sure
argh, sorry for that @jlovejoy @goneall... I suspect this has to do with the fact that we had to put the website behind Cloudflare because the machine was getting under too much stress. We're migrating the website on a new server but that will take a few months still... Does any of you know how to whitelist requests coming from GitHub actions on Cloudflare? Any other suggestions?
I did some looking on the cloudflare documentation and didn't find any obvious solutions.
I wonder if the code that is validating the URL's looks like a DDOS attack to cloudflare. There are quite a few URL's in the SPDX license list that points to the OSI website.
I thought of one other possible solution. We could retrieve the page https://opensource.org/licenses/alphabetical and if the URL shows up as an href in the list of licenses, we could call it valid.
It would depend on 2 things to work:
- the page https://opensource.org/licenses/alphabetical stays where it is
- the page is always up to date
It's a bit hacky, but much less hacky than trusting all opensource.org pages to be valid.
@smaffulli - let me know what you think.
@goneall that approach may work for a while but... The https://opensource.org/licenses/alphabetical page will stay where it is but its structure will change soon (we're updating the website). The good news is that the licenses will be moved to a place with more structure. How is that code working now? which url is it checking? If it's api.opensource.org I may just put that out of Cloudflare and see if that's the issue.
How is that code working now? which url is it checking?
@smaffulli The code is rather generic in that any external URL specified in the license XML is check by pinging that specific web page URL. This code is in the LicenseListPublisher.
If it's api.opensource.org I may just put that out of Cloudflare and see if that's the issue.
Actually, I could use the api rather than the license page - a much more maintainable approach. I'll create a PR with the change. We can see if that change alone fixes it or if you need to move that page out from under cloudflare. I'll post an update once the changes are deployed and tested in production.
@smaffulli - Could you move the api.opensource.org outside of Cloudflare? I just created a PR which now uses the API to validate OSI URL's. With the new PR it is not able to access the API - same symptoms of trying to access the license pages. There is only a single HTTP request, so I'm not sure why cloudflare is blocking it unless it just blocks all requests from the Github CI servers.
One you've moved it, I'll try re-running the CI.
FYI - the PR is at https://github.com/spdx/LicenseListPublisher/pull/143
It is failing a unit test I put in to specifically check if the API page can be accessed. The tests pass on my local machine, it only fails when running in the Github CI.
@goneall done, the api service is responding now.
Thanks @smaffulli - the unit tests passed :)
I'll close this issue once we've updated the LicenseListPublisher in the CI for this repo.
@smaffulli https://api.opensource.org/licenses/
is no longer accessible from Github CI servers.
Can you check and see if there has been any changes to the networking on the OSI side?
Hi @goneall! We recently migrated the API service to a new machine. The certificate has changed (now api has its own certificate which renewals independently from opensource.org) and the Cloudflare filter has been removed. Possibly the former one has caused a problem (warning that the certificate has changed?) Please let us know if you are still encountering problems and we'll try to troubleshoot. Thanks!
@nickvidal I just checked and it now seems to be working.
I'll close this issue as resolved.
Thanks!