warehouse icon indicating copy to clipboard operation
warehouse copied to clipboard

Add logic to verify URLs using HTML meta tag

Open facutuesca opened this issue 1 year ago • 3 comments

Part of https://github.com/pypi/warehouse/issues/8635, this PR adds a function to verify arbitrary URLs by parsing their HTML and looking for a specific meta tag.

Concretely, a webpage with a meta tag in its header element like the following:

<meta content="package1 package2" namespace="pypi.org" rel="me" />

would pass validation for the project1 and project2 PyPI projects.

This PR only adds the function and its tests. The function is not used anywhere yet.

This implementation takes into account the discussion in the issue linked above, starting with this comment: https://github.com/pypi/warehouse/issues/8635#issuecomment-2292013010.

Concretely:

  • URLs must use https://
  • The hostname must be a regular name (i.e.: domain.tld), it cannot be an P address (e.g: https://100.100.100.100)
  • If a port is present, it must be 443 (we could also remove this, and require that no port is present)
  • Before getting the HTML, we resolve the URL to an IP address, and check that it's a global IP and not a private or shared IP
  • We limit the amount of content we download to 1024 bytes (this number was an arbitrary choice, it's open to changes)
  • HTML is parsed using lxml, which recovers from partial HTML, meaning only reading the first N bytes should be fine as long as it contains the tag we are looking for

I'm opening this PR with only the verification logic since it's the part that requires the most review and discussion. Once it's done we can see how to integrate it with the current upload flow (probably as an asynchronous task).

cc @woodruffw @ewjoachim

facutuesca avatar Aug 29 '24 17:08 facutuesca

  • If a port is present, it must be 443 (we could also remove this, and require that no port is present)

I'm +1 on removing this outright -- I think the volume of legitimate users who actually need to explicitly list a port is probably vanishingly small 🙂

woodruffw avatar Aug 29 '24 17:08 woodruffw

(I just opened a random blog (typed "some blog" in google, got into a list of best blog per category, in "education", first link was https://blog.ed.ted.com/, and its complete <head> tag it about 10k bytes. I think 100k bytes is probably much safer than 1024)

ewjoachim avatar Aug 30 '24 12:08 ewjoachim

(I just opened a random blog (typed "some blog" in google, got into a list of best blog per category, in "education", first link was https://blog.ed.ted.com/, and its complete <head> tag it about 10k bytes. I think 100k bytes is probably much safer than 1024)

Changed to 100000 bytes

facutuesca avatar Sep 03 '24 13:09 facutuesca

what’s the state of this? it’s super unfortunate that everything else being verified effectively hides our readthedocs link a little.

flying-sheep avatar Aug 14 '25 08:08 flying-sheep

@flying-sheep I believe there are currently no plans for merging this. There is an issue for improving the visibility of unverified links here: https://github.com/pypi/warehouse/issues/18199

facutuesca avatar Aug 14 '25 10:08 facutuesca

One last thought: we should probably think about how this function should surface the reason why validation failed. I think it's not going to be very helpful (for users or admins) if validation just fails and we can't explain why. I'm not sure this needs to be shared with the user but at least having a way to run the function and get something back other than False is going to be important here.

di avatar Aug 14 '25 15:08 di