pip json parsing error when install package using `--find-links` option

Description

The related issue is described here: https://github.com/PaddlePaddle/Paddle/issues/44707

In brief, user can install the package paddlepaddle via the following command using pip version 22.1.2, but failed in version 22.2.0.

pip install paddlepaddle-gpu==2.3.2.post111 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

So, I think it's a regression issue.

The reported error is

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Because pip is trying to parse the returned html content as json format since pip 22.2.0. I've found the related code is https://github.com/pypa/pip/blob/main/src/pip/_internal/index/collector.py#L327, which is introduced in #11158 .

Although we can update the page to return valid json strings, supporting this HTML format repository in pip's upgrade would help us a lot during our upgrade period.

Expected behavior

pip can parse html format repository page.

pip version

22.2.0

Python version

3.9

OS

All OS

How to Reproduce

works well pip install --upgrade pip==22.1.2 pip install paddlepaddle-gpu==2.3.2.post111 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
failed pip install --upgrade pip==22.2.0 pip install paddlepaddle-gpu==2.3.2.post111 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

Output

File "/opt/homebrew/lib/python3.9/site-packages/pip/_internal/index/package_finder.py", line 794, in process_project_url
    page_links = list(parse_links(index_response))
  File "/opt/homebrew/lib/python3.9/site-packages/pip/_internal/index/collector.py", line 313, in wrapper_wrapper
    return wrapper(CacheablePageContent(page))
  File "/opt/homebrew/lib/python3.9/site-packages/pip/_internal/index/collector.py", line 308, in wrapper
    return list(fn(cacheable_page.page))
  File "/opt/homebrew/lib/python3.9/site-packages/pip/_internal/index/collector.py", line 327, in parse_links
    data = json.loads(page.content)
  File "/opt/homebrew/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/opt/homebrew/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/homebrew/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Code of Conduct

[X] I agree to follow the PSF Code of Conduct.

Aug 30 '22 06:08 jzhang533

This was fixed in a patch release of 22.2. Please upgrade pip to the latest version and confirm the issue is resolved.

Aug 30 '22 06:08 pfmoore

I tested using pip 22.2.2, it still occurs.

$ pip --version pip 22.2.2 from /opt/homebrew/lib/python3.9/site-packages/pip (python 3.9)

Aug 30 '22 06:08 jzhang533

It seems that the find-links page is ignoring the Accept: header and always returns HTML. This means the page technically violates the standard. So I’d say it’s the server’s resposibility to fix this.

Aug 30 '22 06:08 uranusjr

Yes, the server's scripts upgrade is ongoing. I did a quick search in the python community, and have not found similar case caused by this kind of non strict server response. It worths pip to notice this use case when implementing PEP691.

Aug 30 '22 07:08 jzhang533

❯ http head https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html "Accept: application/vnd.pypi.simple.v1+json,application/vnd.pypi.simple.v1+html; q=0.1,text/html; q=0.01"

HTTP/1.1 200 
Connection: keep-alive
Content-Disposition: inline;filename=f.txt
Content-Length: 54587
Content-Type: application/vnd.pypi.simple.v1+json;charset=UTF-8
Date: Tue, 30 Aug 2022 08:50:06 GMT
Server: BLB/22.06.1.2
Set-Cookie: PADDLEID=583d9ec38e7a33a3b86a4e395020e6c7; Max-Age=3600; Expires=Tue, 30-Aug-2022 09:50:06 GMT; HttpOnly

The problem lies with the server, which is serving content with the wrong content-type header (and also messing up the HTTP status code line + whitespace rules). It should be responding with plain text/html, in which case pip will use the corresponding parser.

Aug 30 '22 08:08 pradyunsg

It seems that the find-links page is ignoring the Accept: header and always returns HTML.

Hang on, --find-links isn't meant to use the simple API, and shouldn't be involving the Accept header at all. If you want to use the simple index API you should be using --extra-index-url. If we're trying to get a JSON response from a page pointed to by --find-links, that's a bug.

Aug 30 '22 10:08 pfmoore

Ah, you're right @pfmoore.

I think we'll need to separate the two pieces of logic in our parser, since we shouldn't be trying to parse find-links as JSON.

Aug 30 '22 12:08 pradyunsg

We have updated page: https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html The install command with -f option works on pip 22.2.0 and later versions now. To reproduce this bug, you may need another setup to mimic this case.

Aug 31 '22 12:08 jzhang533

I thought about this a bit more. While application/vnd.pypi.simple.v1+json content type is not designed for find-links pages, the problem here is that the find-links server is reporting an HTML response as application/vnd.pypi.simple.v1+json. This is probably due to the server being naively implemented, simply copying the first entry in Accept: into the response without actually checking.

IMO it should not be unreasonable to expect the server to at least respond with Content-Type: text/html instead, so while this is a new, technically unnecessary requirement to the find-links page, I think it is arguably a reasonable one and can be retained.

Jan 09 '23 08:01 uranusjr

I agree that it's a server bug if it's returning a content type of application/vnd.pypi.simple.v1+json and not returning data that matches this content type.

I don't think we need to worry too much about the parsing side of things - --find-links isn't standardised, so we can parse whatever we want. As long as it's a valid response (i.e., the content type and the content match). If we care about what content types we get for --find-links, we should send an appropriate Accept: header. That can only accept text/html, if that's what we want.

But I thought we parsed --find-links pages differently anyway - it's a flat page of links, not a per-project index structure. So we shouldn't be re-using the parsing logic anyway. (And given that it's not an index, we probably should have a different Accept: header).

Jan 09 '23 09:01 pfmoore

pip pip copied to clipboard

json parsing error when install package using `--find-links` option

Description

Expected behavior

pip version

Python version

OS

How to Reproduce

Output

Code of Conduct

pip
pip copied to clipboard