bandersnatch
bandersnatch copied to clipboard
Implement PEP 691 JSON Simple Index Support
Add logic for bandersnatch to save both the HTML and JSON simple index files. This will allow people to serve both the HTML and JSON in their mirrors.
- Happy to introduce a config to select the formats to save
- Eventually most tools should prefer the JSON we would expect
We should also update docs + give an example way to serve based on request headers (conneg) as outlined in PEP691.
My suggestions here:
Write out 3 files index.html, index.v1_html, and index.v1_json. These will map to:
| Ext | Content Type |
|---|---|
.html |
text/html |
.v1_html |
application/vnd.pypi.simple.v1+html |
.v1_json |
application/vnd.pypi.simple.v1+json |
For Apache, if you have mod_negotiation enabled you can use a .htaccess that looks like this inside of the /simple/ directory:
Options -Indexes +Multiviews
DirectoryIndex index
AddType application/vnd.pypi.simple.v1+json v1_json
AddType application/vnd.pypi.simple.v1+html v1_html
This will:
- Disable autogenerated directory indexes.
- Turn on "MultiViews", which will enable the conneg support map an accept header to a file extension.
- Update the default directory index from
index.htmlto justindex, which will let theMultiViewslook up the correct file extension. - Adds content types for our two custom content types, and tells it what file extension they use.
You can use this in a Docker container using the httpd docker container, but it requires modifying the built in config to enable mod_negotiation and set it to read .htaccess files. A Dockerfile that implements that would look like:
FROM httpd
RUN echo '\n\
LoadModule negotiation_module modules/mod_negotiation.so\n\
\n\
<Directory "/usr/local/apache2/htdocs">\n\
AllowOverride All\n\
</Directory>' >> /usr/local/apache2/conf/httpd.conf
This can be ran using docker run --rm -dit -p 8080:80 -v PATHTOBANDERWEB:/usr/local/apach2/htdocs/ theimagebuiltabove, with the .htaccess added.
Alternatively, you can use nginx. The adapted banderx config looks something like this:
daemon off;
user nginx;
worker_processes auto;
error_log /dev/stderr info;
pid /run/nginx.pid;
events {
worker_connections 2048;
}
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /dev/stdout main;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 69;
types_hash_max_size 2048;
include /etc/nginx/mime.types;
default_type application/octet-stream;
map $http_accept $mirror_suffix {
default ".html";
"~*application/vnd\.pypi\.simple\.latest\+json" ".v1_json";
"~*application/vnd\.pypi\.simple\.latest\+html" ".v1_html";
"~*application/vnd\.pypi\.simple\.v1\+json" ".v1_json";
"~*application/vnd\.pypi\.simple\.v1\+html" ".v1_html";
"~*text/html" ".html";
}
map $arg_format $mirror_suffix_via_url {
"application/vnd.pypi.simple.latest+json" ".v1_json";
"application/vnd.pypi.simple.latest+html" ".v1_html";
"application/vnd.pypi.simple.v1+json" ".v1_json";
"application/vnd.pypi.simple.v1+html" ".v1_html";
"text/html" ".html";
}
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name banderx;
root /data/pypi/web;
autoindex on;
charset utf-8;
location /simple/ {
# Uncomment to support hash_index = true bandersnatch mirrors
# rewrite ^/simple/([^/])([^/]*)/$ /simple/$1/$1$2/ last;
# rewrite ^/simple/([^/])([^/]*)/([^/]+)$/ /simple/$1/$1$2/$3 last;
index index$mirror_suffix_via_url index$mirror_suffix;
types {
application/vnd.pypi.simple.v1+json v1_json;
application/vnd.pypi.simple.v1+html v1_html;
text/html html;
}
# Uncomment to support conneg for files other than
# index, so that /simple/foo will map to /simple/foo.html,
# /simple/foo.v1_html, or /simple/foo.v1_json based on the
# Accept header.
# try_files $uri$mirror_suffix $uri $uri/ =404;
}
# Let us set the correct mime type for all the JSON
location /json/ {
default_type application/json;
}
location /pypi/ {
default_type application/json;
}
error_page 404 /404.html;
location = /40x.html {
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
}
}
}
The big differences between Apache and Nginx here are:
- Apache actually implements conneg, so it will read and interpret the
Acceptheader and select the correct content type based on that.- This means that clients can control which content type they prefer, while still listing all of the content types they support using the
;q=Nparameter to indicate relative preference.
- This means that clients can control which content type they prefer, while still listing all of the content types they support using the
- Nginx does not actually implement conneg, it's just faking support for it by populating the
$mirror_suffixvariable by doing regex testing against theAcceptheader, with a default fallback to.html.- This means that it isn't going to support the
q=Nparameter for clients to express their preference of which content types they prefer, out of the ones they support. This is allowed under conneg, Servers are not required to return the content type the client most prefers, but it's nice if they do since the client presumably has a reason to prefer it. - There's one possible bug here,
;q=0typically disables the content type, but since the nginx config doesn't actually parse/understand theAcceptheader, it will ignore that qvalue as well. Usingq=0is pretty rare, so I don't think it's a particularly big deal.
- This means that it isn't going to support the
- Nginx supports the
latestaliases for our custom content types, Apache does not because Apache's conneg doesn't let us return a different content type than gets matched in theAcceptheader, while Nginx does.- Possibly you could do this with
mod_rewriteor something, I'm not sure.
- Possibly you could do this with
- When conneg fails, Apache defaults to whichever version is the smallest response, Nginx defaults to whatever version is mentioned as the
defaultin the map (in the above case, it's.html).- Apache's behavior could be weird, as different packages will default to html or json depending on which one happens to be smaller. This shouldn't be a big deal for pip, since old versions of pip asked for
text/htmland the PEP 691-ified pip asks for all 3. - It's possible there's some trick with
mod_rewritethat would let you set a default that would be used when there isn't anAcceptheader, I'm not sure.
- Apache's behavior could be weird, as different packages will default to html or json depending on which one happens to be smaller. This shouldn't be a big deal for pip, since old versions of pip asked for
- The Nginx option supports the
?format=query parameter, which will override theAcceptheader if it's been specified.- This may be possible to replicate with
mod_rewrite, I'm not sure.
- This may be possible to replicate with
Personally, I would recommend sticking with nginx for banderx.
I don't think the fact the Nginx's conneg support is not really actually implemented as conneg, but instead some basic regex matching will actually matter for anyone unless they're purposely trying to do weird things, but I think the ability to specifically pick which version is the default is a really nice thing as it lets a mirror operator decide what level of compatibility they want (my above config chooses max compatability) and I think that the extra features supported by the nginx config (latest version, the ?format= url parameter) are nice to have as well.
On the other hand, I think that Apache's behavior of defaulting to whatever response is smallest is nice for saving bandwidth, but I think it's kind of weird that different URLs under /simple/ may end up with randomly different default options.
One additional thing:
The above assumes that bandersnatch is going to swap out from writing just index.html files, to writing the 3 files mentioned above alongside each other, which makes a lot of sense for people who want a single URL to support all of the content types available.
Some people may want to not rely on conneg, and have different URLs for different content types. I think bandersnatch could support this pretty easily using two options:
- If a configuration format is introduced to filter the content types that bandersnatch will emit, then obviously you could just run multiple copies of bandersnatch with different content types filtered.
- Support an option to store the different content types in different root directories, so instead of something like
/data/pypi/web/simple/pkgname/, if this option was turned on you would do/data/pypi/web/simple/html/pkgname/,/data/pypi/web/simple/v1+json/pkgname/, etc.- Using this would then mean doing something like
pip install -i https://example.com/simple/v1+json/. - This might be YAGNI, maybe nobody actually wants to do this. Just a random idea that popped into my head that is supported by PEP691, that people might want to do.
- Using this would then mean doing something like
#1154 + #1161