hvplot icon indicating copy to clipboard operation
hvplot copied to clipboard

Change to https

Open MarcSkovMadsen opened this issue 1 year ago • 6 comments

Fixes https://github.com/holoviz/hvplot/issues/862

Notes

  • I found that changing to https in SVGs makes them invalid. So I did not do it.
  • I found some invalid urls and fixed them
  • There is one invalid url left that I cannot fix, which is https://github.com/holoviz/hvplot/issues/870
  • I updated the test_links.py file to test all files and urls. But some domain refuse to respond to request requests so I exempted those.
  • I manually opened and ran all the notebooks to verify they still worked. I discoved some unrelated issues
    • https://github.com/holoviz/holoviews/issues/5412
    • https://github.com/holoviz/hvplot/issues/871

MarcSkovMadsen avatar Aug 28 '22 06:08 MarcSkovMadsen

The tests fail due to the obsolete url mentioned in https://github.com/holoviz/hvplot/issues/870. I can't find the url.

Let me know if there is more I should do. Thanks.

MarcSkovMadsen avatar Aug 28 '22 10:08 MarcSkovMadsen

I will just add the invalid URL to SKIP_URLS. It is "only" metadata, and I feel it is more correct to have the site on which the data was downloaded even though it does not exist anymore.

Alternatively, you could update the URL to the new site I mentioned in issue holoviz/hvplot#870, but I will at least verify that it is actually possible to get the data on the new site.

So there are two ways forward: one is easy, and one is hard. I will let it be up to you which one to choose.

hoxbro avatar Aug 29 '22 07:08 hoxbro

I would prefer to have a valid url from the new site.

I tried to find it without luck. As I don't have the original context I would not be able to know if I found the right url. I don't even know how Intake works or what the purpose of the meta data is. And for now I don't need to learn.

Can someone else help find the correct url?

MarcSkovMadsen avatar Aug 29 '22 10:08 MarcSkovMadsen

Think this is probably the best link: https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2019

philippjfr avatar Sep 05 '22 09:09 philippjfr

Alternatively there's also: https://crime-data-explorer.fr.cloud.gov/pages/explorer/crime/crime-trend

philippjfr avatar Sep 05 '22 09:09 philippjfr

I would change the test_links.py to the following. Even though it is nice with the fixtures, I think we should run all URLs in one go. This can both be good if there are duplicate URLs in different files and avoid overhead when starting up the ThreadPoolExecutor. If a URL fail it should be easy to find the file by searching for the URL.

I changed also changed it to a simpler regex and using requests.head to only get the header of the site. Though anaconda.org give status code 400 when using head and 200 when using get.

"""Urls in docstrings etc. should be valid and secure, i.e.

- exist, i.e. provide a 200 response
- use https:// instead of http:// unless
    - https:// is not supported by the web site
    - https:// cannot be used. For example in SVGs.
"""

import pathlib
import re
from concurrent.futures import ThreadPoolExecutor
from functools import partial

import requests

URL_REGEX = re.compile(r"(https?:\/\/?[\w/\-?=%.]+\.[\w/\-&?=%.]+)")
ROOT = pathlib.Path(__file__).parents[2]
POST_FIXES = ["*.py", "*.ipynb", "*.md", "*.yaml", "*.yml"]
SKIP_URLS = [
    "https://www.ucrdatatool.gov/Search/Crime/State/StatebyState.cfm",  # Not available anymore
    "http://weegee.vision.ucmerced.edu/datasets/landuse.html",  # Not available anymore
    "https://github.com/holoviz/hvplot",  # Alot of issues/PRs
    "http://octopart.com",  # Located in Notebook output, not relevant
    "https://jsperf.com",  # Located in Notebook output, not relevant
    "https://web.archive.org",  # Can be slow and does not work with the regex.
    "https://anaconda.org/anaconda/hvplot",  # Give a status code of 400?
    "https://anaconda.org/conda-forge/hvplot",  # Give a status code of 400?
    "https://anaconda.org/pyviz/hvplot",  # Give a status code of 400?
    "https://bugs.chromium.org/p/chromium/issues",  # Give status code 405 (needs login)
    "https://mybinder.org",  # Give status code 405
]
SKIP_FILES = [
    ".ipynb_checkpoints",  # Can give problems when running test local
]


def _get_files_to_check():
    for post_fix in POST_FIXES:
        for file in ROOT.rglob(post_fix):
            if any(skip in str(file) for skip in SKIP_FILES):
                continue
            yield file


def _find_urls(text):
    urls = set()
    for url in set(re.findall(URL_REGEX, text)):
        if url.endswith("."):
            # Regex will keep a dot at the end
            urls.add(url[:-1])
            continue
        if any(url.startswith(skip) for skip in SKIP_URLS):
            continue
        urls.add(url)

    return urls


def _verify_urls(urls):
    func = partial(requests.head, timeout=5)

    with ThreadPoolExecutor() as executor:
        futures = executor.map(func, urls)

    for url, response in zip(urls, futures):
        if not response.ok:
            raise ValueError(
                f"The url {url} responded with status {response.status_code}."
            )


def test_urls():
    urls = set()
    for file in _get_files_to_check():
        text = file.read_text()
        urls |= _find_urls(text)

    urls = sorted(urls)
    _verify_urls(urls)

hoxbro avatar Sep 12 '22 09:09 hoxbro