license-list-XML icon indicating copy to clipboard operation
license-list-XML copied to clipboard

Some pages have 'no longer live' next to OSI page links, but the links are not dead

Open kpfleming opened this issue 2 years ago • 11 comments

Example: https://spdx.org/licenses/GPL-3.0-only.html

This has [no longer live] next to the opensource.org link for the license, but that text does not appear in the source file for the page, and the link is not dead (it works fine).

kpfleming avatar Aug 23 '22 15:08 kpfleming

The links are automatically checked by the licenselistpublisher. When I get some time, I'll investigate why the test is failing. The code that does this check is here: https://github.com/spdx/LicenseListPublisher/blob/fd6416916438e6a50b579ac68a313b760c8d8f4c/src/org/spdx/crossref/Live.java#L46

goneall avatar Aug 23 '22 16:08 goneall

Update: When I run the LicenseListPublisher locally, it finds reports the link live. But it looks like when it is run with the github action, it shows the link as not accessible.

From a short sampling of the licenses containing cross references with isLive = false, it looks like all OSI web pages are listed as not live.

Based on this analysis, it looks like there is something unique about the Github action environment that is causing the non-response on the HTTPS request.

@kpfleming Do you happen to know anyone at OSI that may have information on any website protections that may be causing this behavior?

goneall avatar Aug 24 '22 01:08 goneall

We could do a "hack" and allow all OSI web pages to be "live" even if they may really be dead links. I would be much more comfortable, however, if we could identify the root cause of this issue and resolve it.

goneall avatar Aug 24 '22 01:08 goneall

Possibly @smaffulli could help out here.

kpfleming avatar Aug 24 '22 10:08 kpfleming

@smaffulli - The code sets the following as the request header to fake a browser request:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36

Is there anything else the OSI website may be looking for to validate the request? It's interesting this runs fine on my local machine but fails when running under a Github action.

The test will fail if a response code other than 200, 304, 301, or 302 is returned. It will also fail if the response times out after 8.5 seconds.

goneall avatar Aug 24 '22 18:08 goneall

thanks @kpfleming for noticing this and @goneall for looking into it. I did notice it was also doing this on GPL-3.0-or-later but didn't go far enough to realize it was all of them! Would be good to have this fixed at the OSI level for sure

jlovejoy avatar Aug 24 '22 19:08 jlovejoy

argh, sorry for that @jlovejoy @goneall... I suspect this has to do with the fact that we had to put the website behind Cloudflare because the machine was getting under too much stress. We're migrating the website on a new server but that will take a few months still... Does any of you know how to whitelist requests coming from GitHub actions on Cloudflare? Any other suggestions?

smaffulli avatar Aug 25 '22 03:08 smaffulli

I did some looking on the cloudflare documentation and didn't find any obvious solutions.

I wonder if the code that is validating the URL's looks like a DDOS attack to cloudflare. There are quite a few URL's in the SPDX license list that points to the OSI website.

goneall avatar Aug 27 '22 02:08 goneall

I thought of one other possible solution. We could retrieve the page https://opensource.org/licenses/alphabetical and if the URL shows up as an href in the list of licenses, we could call it valid.

It would depend on 2 things to work:

  • the page https://opensource.org/licenses/alphabetical stays where it is
  • the page is always up to date

It's a bit hacky, but much less hacky than trusting all opensource.org pages to be valid.

@smaffulli - let me know what you think.

goneall avatar Aug 27 '22 02:08 goneall

@goneall that approach may work for a while but... The https://opensource.org/licenses/alphabetical page will stay where it is but its structure will change soon (we're updating the website). The good news is that the licenses will be moved to a place with more structure. How is that code working now? which url is it checking? If it's api.opensource.org I may just put that out of Cloudflare and see if that's the issue.

smaffulli avatar Aug 28 '22 00:08 smaffulli

How is that code working now? which url is it checking?

@smaffulli The code is rather generic in that any external URL specified in the license XML is check by pinging that specific web page URL. This code is in the LicenseListPublisher.

If it's api.opensource.org I may just put that out of Cloudflare and see if that's the issue.

Actually, I could use the api rather than the license page - a much more maintainable approach. I'll create a PR with the change. We can see if that change alone fixes it or if you need to move that page out from under cloudflare. I'll post an update once the changes are deployed and tested in production.

goneall avatar Sep 08 '22 18:09 goneall

@smaffulli - Could you move the api.opensource.org outside of Cloudflare? I just created a PR which now uses the API to validate OSI URL's. With the new PR it is not able to access the API - same symptoms of trying to access the license pages. There is only a single HTTP request, so I'm not sure why cloudflare is blocking it unless it just blocks all requests from the Github CI servers.

One you've moved it, I'll try re-running the CI.

FYI - the PR is at https://github.com/spdx/LicenseListPublisher/pull/143

It is failing a unit test I put in to specifically check if the API page can be accessed. The tests pass on my local machine, it only fails when running in the Github CI.

goneall avatar Oct 21 '22 23:10 goneall

@goneall done, the api service is responding now.

smaffulli avatar Oct 22 '22 14:10 smaffulli

Thanks @smaffulli - the unit tests passed :)

I'll close this issue once we've updated the LicenseListPublisher in the CI for this repo.

goneall avatar Oct 22 '22 16:10 goneall

@smaffulli https://api.opensource.org/licenses/ is no longer accessible from Github CI servers.

Can you check and see if there has been any changes to the networking on the OSI side?

goneall avatar Apr 18 '23 20:04 goneall

Hi @goneall! We recently migrated the API service to a new machine. The certificate has changed (now api has its own certificate which renewals independently from opensource.org) and the Cloudflare filter has been removed. Possibly the former one has caused a problem (warning that the certificate has changed?) Please let us know if you are still encountering problems and we'll try to troubleshoot. Thanks!

nickvidal avatar Apr 19 '23 00:04 nickvidal

@nickvidal I just checked and it now seems to be working.

I'll close this issue as resolved.

Thanks!

goneall avatar Apr 19 '23 00:04 goneall