mobility-database-catalogs icon indicating copy to clipboard operation
mobility-database-catalogs copied to clipboard

bug: 403 errors when validating a URL

Open emmambd opened this issue 2 years ago • 4 comments

What problem is your feature request trying to solve? One of the GitHub workflow checks evaluates if the GTFS feed can be downloaded. Often this check returns requests.exceptions.HTTPError: 403 Client Error: Forbidden for url due to SSL certificate errors, even when the URL can be downloaded manually. This becomes a blocker to add new sources. Example here.

Other examples where this is affecting our ability to get feeds: mdb-534 http://www.centro.org/CentroGTFS/CentroGTFS.zip

Describe the solution you'd like Unclear the best solution currently. Any thoughts would be very useful!

How will we know when this is done? As a user, I can add a source when the source is downloadable manually.

emmambd avatar Sep 08 '22 19:09 emmambd

It seems like part of the problem is related to not identifying a User-Agent in the request header, like described here. Adding a User-Agent has solved the problem for many URLs locally, so it would be useful adding it to the workflow and code.

maximearmstrong avatar Sep 08 '22 20:09 maximearmstrong

This PR seems to have decreased instances of the error, but it still has not removed all of them.

emmambd avatar Sep 20 '22 17:09 emmambd

The example referenced (http://datos.gob.cl/dataset/c77c9a50-6dd1-449d-b5ab-947ec0139b31/resource/a4edcf07-0657-456d-bbbc-54b2aec1de8d/download/coquimbo10feb16.zip) fails checks for complete certificate chain in a couple of popular SSL checkers:

Screen Shot 2022-10-11 at 12 15 45 PM

Screen Shot 2022-10-11 at 12 15 29 PM

It looks like it's using an SSL root that's not widely distributed yet. This would be a matter of updating the root certificates installed at the operating system level, or instructing the command that checks that the URLs can be downloaded to ignore SSL errors. These appear to be Amazon-issued certificates so it's surprising that the GitHub runners aren't coming with them installed. Bumping the runner to ubuntu-22.04 may fix the issue but the current runner is ubuntu-20.04 which is LTS and it's surprising that it wouldn't have Amazon's CA backported into the default trusted root certs

Edit: looks like Python doesn't use system certificates by default, so this could be a matter of the Python version. This SA post indicates how to tell Python to use the system certificates which might be a good idea here: https://stackoverflow.com/a/42982144/964125

There's also a bigger question of how strict SSL checking should be to consider a feed valid. Using the system-installed root certs that come with ubuntu-latest rather than depending on what comes with the particular Python version being installed is probably a good baseline, anything failing SSL checks under that probably should be indicated as a failing feed

themightychris avatar Oct 11 '22 16:10 themightychris

@themightychris Thank you for digging into this! re: ubuntu, it looks like GitHub Actions haven't updated to ubuntu-22.04 as the default for ubuntu-latest which is why it's running an older version.

I added a draft PR that points to the system certificates to see if that would have an impact on the workflow test, but it seems to not have made a difference (which could definitely be a problem on my end). Do you mind taking a look?

As a short term solution, we've talked about ignoring the test when it fails and manually testing that the URL is working and downloads a ZIP file. This is obviously not ideal, but may help with adding feeds as we debug and evaluate the certs problem.

emmambd avatar Oct 20 '22 21:10 emmambd