pip
pip copied to clipboard
json parsing error when install package using `--find-links` option
Description
The related issue is described here: https://github.com/PaddlePaddle/Paddle/issues/44707
In brief, user can install the package paddlepaddle via the following command using pip version 22.1.2, but failed in version 22.2.0.
pip install paddlepaddle-gpu==2.3.2.post111 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
So, I think it's a regression issue.
The reported error is
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Because pip is trying to parse the returned html content as json format since pip 22.2.0. I've found the related code is https://github.com/pypa/pip/blob/main/src/pip/_internal/index/collector.py#L327, which is introduced in #11158 .
Although we can update the page to return valid json strings, supporting this HTML format repository in pip's upgrade would help us a lot during our upgrade period.
Expected behavior
pip can parse html format repository page.
pip version
22.2.0
Python version
3.9
OS
All OS
How to Reproduce
-
works well pip install --upgrade pip==22.1.2 pip install paddlepaddle-gpu==2.3.2.post111 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
-
failed pip install --upgrade pip==22.2.0 pip install paddlepaddle-gpu==2.3.2.post111 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
Output
File "/opt/homebrew/lib/python3.9/site-packages/pip/_internal/index/package_finder.py", line 794, in process_project_url
page_links = list(parse_links(index_response))
File "/opt/homebrew/lib/python3.9/site-packages/pip/_internal/index/collector.py", line 313, in wrapper_wrapper
return wrapper(CacheablePageContent(page))
File "/opt/homebrew/lib/python3.9/site-packages/pip/_internal/index/collector.py", line 308, in wrapper
return list(fn(cacheable_page.page))
File "/opt/homebrew/lib/python3.9/site-packages/pip/_internal/index/collector.py", line 327, in parse_links
data = json.loads(page.content)
File "/opt/homebrew/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/opt/homebrew/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/homebrew/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Code of Conduct
- [X] I agree to follow the PSF Code of Conduct.
This was fixed in a patch release of 22.2. Please upgrade pip to the latest version and confirm the issue is resolved.
I tested using pip 22.2.2, it still occurs.
$ pip --version pip 22.2.2 from /opt/homebrew/lib/python3.9/site-packages/pip (python 3.9)
It seems that the find-links page is ignoring the Accept: header and always returns HTML. This means the page technically violates the standard. So I’d say it’s the server’s resposibility to fix this.
Yes, the server's scripts upgrade is ongoing. I did a quick search in the python community, and have not found similar case caused by this kind of non strict server response. It worths pip to notice this use case when implementing PEP691.
❯ http head https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html "Accept: application/vnd.pypi.simple.v1+json,application/vnd.pypi.simple.v1+html; q=0.1,text/html; q=0.01"
HTTP/1.1 200
Connection: keep-alive
Content-Disposition: inline;filename=f.txt
Content-Length: 54587
Content-Type: application/vnd.pypi.simple.v1+json;charset=UTF-8
Date: Tue, 30 Aug 2022 08:50:06 GMT
Server: BLB/22.06.1.2
Set-Cookie: PADDLEID=583d9ec38e7a33a3b86a4e395020e6c7; Max-Age=3600; Expires=Tue, 30-Aug-2022 09:50:06 GMT; HttpOnly
The problem lies with the server, which is serving content with the wrong content-type header (and also messing up the HTTP status code line + whitespace rules). It should be responding with plain text/html, in which case pip will use the corresponding parser.
It seems that the
find-linkspage is ignoring theAccept:header and always returns HTML.
Hang on, --find-links isn't meant to use the simple API, and shouldn't be involving the Accept header at all. If you want to use the simple index API you should be using --extra-index-url. If we're trying to get a JSON response from a page pointed to by --find-links, that's a bug.
Ah, you're right @pfmoore.
I think we'll need to separate the two pieces of logic in our parser, since we shouldn't be trying to parse find-links as JSON.
We have updated page: https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
The install command with -f option works on pip 22.2.0 and later versions now.
To reproduce this bug, you may need another setup to mimic this case.
I thought about this a bit more. While application/vnd.pypi.simple.v1+json content type is not designed for find-links pages, the problem here is that the find-links server is reporting an HTML response as application/vnd.pypi.simple.v1+json. This is probably due to the server being naively implemented, simply copying the first entry in Accept: into the response without actually checking.
IMO it should not be unreasonable to expect the server to at least respond with Content-Type: text/html instead, so while this is a new, technically unnecessary requirement to the find-links page, I think it is arguably a reasonable one and can be retained.
I agree that it's a server bug if it's returning a content type of application/vnd.pypi.simple.v1+json and not returning data that matches this content type.
I don't think we need to worry too much about the parsing side of things - --find-links isn't standardised, so we can parse whatever we want. As long as it's a valid response (i.e., the content type and the content match). If we care about what content types we get for --find-links, we should send an appropriate Accept: header. That can only accept text/html, if that's what we want.
But I thought we parsed --find-links pages differently anyway - it's a flat page of links, not a per-project index structure. So we shouldn't be re-using the parsing logic anyway. (And given that it's not an index, we probably should have a different Accept: header).