devpi
devpi copied to clipboard
Replicas with --requests-only
Hi :)
I have just noticed that there is duplicate documentation about the --requests-only
feature:
First occurrence:
- makes it sounds like the feature is well suited to satisfy “pip install” requests
- https://devpi.net/docs/devpi/devpi/latest/+doc/adminman/server.html#multi-process-high-performance-setups
Second occurrence:
- marks the feature as experimental
- warns that the feature is not suitable for even for package downloads due to potential concurrent writes
- https://devpi.net/docs/devpi/devpi/latest/+doc/adminman/server.html#multiple-server-instances
From the way I understand the Devpi transaction code I would believe that downloads should be possible even with sqlite and the second occurrence is outdated?
The latter one still stands. It will work if no mirror index is involved. If you only use private indexes and don't base on root/pypi or a custom mirror, then there will only be reads. As soon as there is a mirror index involved, even a read of a simple index page may cause a write to update the data from the mirror after the cache has expired and the available releases on the mirror have changed.
I have some ideas which could potentially drastically improve concurrent write performance when using postgresql, but they require quite some refactoring. I'm planning to write a prototype to test the performance, but haven't got time for that yet.
Forgot to add: That same backend change would make multiple instances more performant as well.
What would happen if one would use the feature even with root/pypi
or other mirrors? Should we expect crashes or even corrupt data?
We have been using the following setup for a few days and have not noticed any issues so far:
- nginx as load balancer
- forwards GET requests from pip/setuptools to eight worker instances with
--mode replica --requests-only
- forwards everything else from to one ordinary replica instance with
--mode replica
- forwards GET requests from pip/setuptools to eight worker instances with
- all nine replica have the same master process configured. This is an ordinary master so it runs without the requests only flag.
I would expect that only the master triggers the write operations. Those are then made available to the ordinary replicas via the +changelog
mechanism. The request-only replicas learn about the newly replicated data once they open their next read transaction, but they will never perform write requests themselves.
So over all, I am not expecting any write conflicts as there will always only be on writer per database. Am I missing something? Thanks!
I'd say your thinking is correct. I hadn't thought of using replicas with --requests-only
, good idea! Are you using devpi-web in that setup? If you do, then I'd suspect that the search and descriptions don't work as expected, as there is no worker thread to process the events.
Hello @StephanErb, I am wondering if this configuration is still working well for you? Also, how are the --serverdir setup for your instances? Do the master and the 'ordinary' replica have their own directories (say 'masterDir' and 'replicaDir') and the request-only replicas also point to replicaDir? Thank you!
@sdwilliams61 yeah the config still works for us. We are using the following commandlines:
Worker replica (we run about 8 of those per box):
/opt/devpi-4/replica/venv/bin/python3.6 /opt/devpi-4/replica/venv/bin/devpi-server --port 3154 --serverdir /opt/devpi-4/replica/data --role replica --logger-cfg /opt/devpi-4/replica/logger.yaml --keyfs-cache-size 32768 --mirror-cache-expiry 1800 --replica-max-retries 3 --master-url https://<SNIP>:8443 --theme /opt/devpi-4/theme --requests-only --threads 1
Regular replica:
/opt/devpi-4/replica/venv/bin/python3.6 /opt/devpi-4/replica/venv/bin/devpi-server --port 3142 --serverdir /opt/devpi-4/replica/data --role replica --logger-cfg /opt/devpi-4/replica/logger.yaml --keyfs-cache-size 32768 --mirror-cache-expiry 1800 --replica-max-retries 3 --master-url https://<SNIP>:8443 --theme /opt/devpi-4/theme
Master:
/opt/devpi-4/master/venv/bin/python3.6 /opt/devpi-4/master/venv/bin/devpi-server --port 3141 --serverdir /opt/devpi-4/master/data --role master --logger-cfg /opt/devpi-4/master/logger.yaml --keyfs-cache-size 32768 --mirror-cache-expiry 1800 --replica-max-retries 3
I hope this allows you to extract the information you are looking for. Right now we hide the regular replica and master from users and only expose the workers via nginx as those offer the best performance.
Excellent! Thank you very much.
We have been using this configuration successfully for some months now. One problem we have occasionally is the corruption of the .nodeinfo file in /replica/data. The (worker) replicas each read-then-write this single file. In cases of poor spacing between replica restarts (read: nearly simultaneous), this issue crops up. (Does the .nodeinfo data ever change?)
@sdwilliams61 we plan to get rid of most remaining data in .nodeinfo anyway. I'll look into the issue you describe, I have a suspicion on what is causing that.
The .nodeinfo
corruption should be fixed by this commit https://github.com/devpi/devpi/commit/cef4ebeb5b33b24f281ded5d35534b8e99ef32dc in upcoming devpi-server 6.0.0
Looking into setting up multiple read replicas as well. Is --requests-only
still experimental and not safe with mirrored indexes? What's the currently recommended setup if you need replicas due to load but still want to be able to have the search UI work?
I would try to avoid replicas for load balancing. They were added for globally distributed data centers.
For the pip install
use case I would recommend serving release files by nginx as is already done with the default nginx config created by devpi-genconfig
. Additionally I would cache all requests to +simple
pages with some form of re-validation depending on how important picking up new releases in time is. Maybe add general caching depending on installer user agents to avoid load when the +simple
part is missing in the index URL for the installer, see https://github.com/devpi/devpi/blob/d5212d1e21849f6d058f9514f8035a60d5cc1f12/server/devpi_server/views.py#L68-L69
With latest pip and devpi-server you can also add caching depending on the accept header for PEP-691 application/vnd.pypi.simple.v1+json
on all pages to capture all possible variants of index URLs.
Thanks @fschulze, caching for installer user agents sounds promising, appreciate the suggestions!
So I think have something working similarly to what you describe. Since devpi doesn't set cache control headers on anything but files it seems, I need to proxy nginx to itself to add them. WDYT?
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=devpi_cache:10m
inactive=60m use_temp_path=off;
upstream devpi {
server ${WEB_DEVPI_SERVER};
}
map $http_x_forwarded_proto $thescheme {
default $scheme;
https https;
}
server {
server_name ${WEB_SERVER_NAME};
listen 8000;
gzip on;
gzip_min_length 2000;
gzip_proxied any;
gzip_types application/json;
client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;
root /devpi/;
location ~ /\+f/ {
error_page 418 = @proxy_to_app;
if ($request_method !~ (GET)|(HEAD)) {
return 418;
}
expires max;
try_files /+files$uri @proxy_to_app;
}
location ~ /\+doc/ {
try_files $uri @proxy_to_app;
}
location / {
error_page 418 = @proxy_to_app;
return 418;
}
set $no_cache 1;
if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
{
set $no_cache 0;
}
location @proxy_to_app {
add_header X-Cache-Status $upstream_cache_status;
proxy_cache_bypass $no_cache;
proxy_cache devpi_cache;
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Proto $thescheme;
proxy_set_header X-Outside-URL $thescheme://$http_host;
proxy_pass http://127.0.0.1:8001;
}
}
server {
server_name ${WEB_SERVER_NAME};
listen 8001;
access_log off;
client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;
location / {
if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
{
expires 1h;
}
proxy_pass http://devpi;
}
}
I think you can use proxy_ignore_headers
instead of recursive proxy.
You might want to add an if
for /\+simple/?
locations to enable caching.
Have you tested application/vnd.pypi.simple.v1+json
from newer pip? It should produce different cache keys. You might want to add an if
for that as well to explicitly enable caching.
Would you mind if I add this as an example to devpi genconfig
when it works?
I think you can use
proxy_ignore_headers
instead of recursive proxy.
I tried that; I may have been doing something wrong. But I'm not sure ignoring headers will work since devpi isn't returning any cache control headers to be ignored. The issue seems to be that nginx proxy caching needs the upstream server to actually say something needs to be cached.
A future fix might be to have devpi optionally return cache control headers for responses other than files?
You might want to add an
if
for/\+simple/?
locations to enable caching.Have you tested
application/vnd.pypi.simple.v1+json
from newer pip? It should produce different cache keys. You might want to add anif
for that as well to explicitly enable caching.
Both of those seem to work, though collectively they are somewhat redundant. Also, for my use case I'd rather not cache responses to browsers. I just need to cache responses for package installers to handle high load from them.
Would you mind if I add this as an example to
devpi genconfig
when it works?
Totally, I'll post a new version with the additional options for if
blocks.
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=devpi_cache:10m
inactive=60m use_temp_path=off;
upstream devpi {
server ${WEB_DEVPI_SERVER};
}
map $http_x_forwarded_proto $thescheme {
default $scheme;
https https;
}
server {
server_name ${WEB_SERVER_NAME};
listen 8000;
gzip on;
gzip_min_length 2000;
gzip_proxied any;
gzip_types text/plain text/css text/xml
application/json application/vnd.pypi.simple.v1+json
application/javascript text/javascript
application/xml application/xml+rss;
client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;
root /devpi/;
location ~ /\+f/ {
error_page 418 = @proxy_to_app;
if ($request_method !~ (GET)|(HEAD)) {
return 418;
}
expires max;
try_files /+files$uri @proxy_to_app;
}
location ~ /\+doc/ {
try_files $uri @proxy_to_app;
}
location / {
error_page 418 = @proxy_to_app;
return 418;
}
set $no_cache 1;
if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
{
set $no_cache 0;
}
if ($request_uri ~ ".*/\+simple/")
{
set $no_cache 0;
}
if ($http_accept ~ ".*application/vnd\.pypi\.simple\.v1\+json")
{
set $no_cache 0;
}
location @proxy_to_app {
add_header X-Cache-Status $upstream_cache_status;
proxy_cache_bypass $no_cache;
proxy_cache devpi_cache;
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Proto $thescheme;
proxy_set_header X-Outside-URL $thescheme://$http_host;
proxy_pass http://127.0.0.1:8001;
}
}
server {
server_name ${WEB_SERVER_NAME};
listen 8001;
access_log off;
client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;
set $no_cache 1;
if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
{
set $no_cache 0;
}
if ($request_uri ~ ".*/\+simple/")
{
set $no_cache 0;
}
if ($http_accept ~ ".*application/vnd\.pypi\.simple\.v1\+json")
{
set $no_cache 0;
}
set $expires_value "1h";
if ($no_cache) {
set $expires_value "off";
}
location / {
expires $expires_value;
proxy_pass http://devpi;
}
}
I think I plan to use a simpler version that just caches based on user agent. The reason we need a cache at all is the load from lots of CI/CD builds happening together. I don't see any need to cache from just people browsing around or curling, say, which just doesn't happen that much.
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=devpi_cache:10m
inactive=10m use_temp_path=off;
upstream devpi {
server ${WEB_DEVPI_SERVER};
}
map $http_x_forwarded_proto $thescheme {
default $scheme;
https https;
}
server {
server_name ${WEB_SERVER_NAME};
listen 8000;
gzip on;
gzip_min_length 2000;
gzip_proxied any;
gzip_types text/plain text/css text/xml
application/json application/vnd.pypi.simple.v1+json
application/javascript text/javascript
application/xml application/xml+rss;
client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;
root /devpi/;
location ~ /\+f/ {
error_page 418 = @proxy_to_app;
if ($request_method !~ (GET)|(HEAD)) {
return 418;
}
expires max;
try_files /+files$uri @proxy_to_app;
}
location ~ /\+doc/ {
try_files $uri @proxy_to_app;
}
location / {
error_page 418 = @proxy_to_app;
return 418;
}
set $no_cache 1;
if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
{
set $no_cache 0;
}
location @proxy_to_app {
add_header X-Cache-Status $upstream_cache_status;
proxy_cache devpi_cache;
proxy_cache_bypass $no_cache;
proxy_set_header Host $http_host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Proto $thescheme;
proxy_set_header X-Outside-URL $thescheme://$http_host;
proxy_pass http://127.0.0.1:8001;
}
}
server {
server_name ${WEB_SERVER_NAME};
listen 8001;
access_log off;
client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;
set $no_cache 1;
if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
{
set $no_cache 0;
}
set $expires_value "10m";
if ($no_cache) {
set $expires_value "off";
}
location / {
expires $expires_value;
proxy_pass http://devpi;
}
}
Testing with a local index seems to work. I uploaded an internal package to a local index and then did a pip install
of it. Then I uploaded a new version and did a pip install --upgrade
of the same. For the duration of the cache lifetime, the upgrade did nothing, then afterwards it found the new version and installed it.
The key for caching without recursive proxy seems to be: proxy_cache_valid 200 10m;
The current output I have for devpi-genconfig
:
# adjust to your system and liking
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=devpi_cache:10m inactive=1800s use_temp_path=off;
map $http_x_forwarded_proto $x_scheme {
default $scheme;
http http;
https https;
}
server {
server_name localhost $hostname "";
listen 80;
gzip on;
gzip_min_length 2000;
gzip_proxied any;
# add application/vnd.pypi.simple.v1+json to the gzip_types
gzip_types text/plain text/css text/xml
application/json application/vnd.pypi.simple.v1+json
application/javascript text/javascript
application/xml application/xml+rss;
proxy_read_timeout 60s;
client_max_body_size 64M;
# set to where your devpi-server state is on the filesystem
root /Users/fschulze/.devpi/server;
# by default we bypass the cache
set $no_cache 1;
# if we detect a known installer, we enable caching
if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
{
set $no_cache 0;
}
# for https://peps.python.org/pep-0691/ we also enable caching
if ($http_accept ~ ".*application/vnd\.pypi\.simple\.v1\+json")
{
set $no_cache 0;
}
# try serving static files directly
location ~ /\+f/ {
# workaround to pass non-GET/HEAD requests through to the named location below
error_page 418 = @proxy_to_app;
if ($request_method !~ (GET)|(HEAD)) {
return 418;
}
expires max;
try_files /+files$uri @proxy_to_app;
}
# try serving docs directly
location ~ /\+doc/ {
# if the --documentation-path option of devpi-web is used,
# then the root must be set accordingly here
root /Users/fschulze/.devpi/server;
try_files $uri @proxy_to_app;
}
location / {
# workaround to pass all requests to / through to the named location below
error_page 418 = @proxy_to_app;
return 418;
}
location @proxy_to_app {
# use the keys_zone defined above
proxy_cache devpi_cache;
proxy_cache_bypass $no_cache;
add_header X-Cached $upstream_cache_status;
# adjust to your liking
proxy_cache_valid 200 1800s;
proxy_pass http://localhost:3141;
proxy_set_header X-Forwarded-Proto $x_scheme;
proxy_set_header X-outside-url $x_scheme://$http_host;
proxy_set_header X-Real-IP $remote_addr;
}
}
See https://github.com/fschulze/devpi/commit/a341a59f5e0533e15f1282de8c1528936fc5ec1b
Nice, proxy_cache_valid
seems to be doing the trick!
After some experimentation, it occurs to me that files should never be cached by nginx, since they get cached on the filesystem. WDYT about adding this as the last if
block?
# do not cache files
if ($request_uri ~ ".*/\+f/")
{
set $no_cache 1;
}
It is better to set
it in the location, see https://www.nginx.com/resources/wiki/start/topics/tutorials/config_pitfalls/ and https://www.nginx.com/resources/wiki/start/topics/depth/ifisevil/
I also renamed the variable and switched to using map
.
You could move the if
s to the /
location and remove the set $bypass_caching 1;
from the other locations for a more efficient config. I kept them there with a comment for other use cases.
# adjust the path for your system,
# the size (in keys_zone) and the life time to your liking,
# by default the life time matches the mirror cache expiry setting
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=devpi_cache:10m inactive=1800s use_temp_path=off;
map $http_user_agent $devpi_installer_agent {
default 0;
~*distribute/ 1;
~*setuptools/ 1;
~*pip/ 1;
~*pex/ 1;
}
map $http_accept $devpi_installer_accept {
default 0;
~*application/vnd\.pypi\.simple\.v1\+json 1;
}
map $http_x_forwarded_proto $x_scheme {
default $scheme;
http http;
https https;
}
server {
server_name localhost $hostname "";
listen 80;
gzip on;
gzip_min_length 2000;
gzip_proxied any;
# add application/vnd.pypi.simple.v1+json to the gzip_types
gzip_types text/plain text/css text/xml
application/json application/vnd.pypi.simple.v1+json
application/javascript text/javascript
application/xml application/xml+rss;
proxy_read_timeout 60s;
client_max_body_size 64M;
# set to where your devpi-server state is on the filesystem
root /Users/fschulze/.devpi/server;
# by default we bypass the cache
set $bypass_caching 1;
# if we detect a known installer, we enable caching
if ($devpi_installer_agent)
{
set $bypass_caching 0;
}
# for https://peps.python.org/pep-0691/ we also enable caching
if ($devpi_installer_accept)
{
set $bypass_caching 0;
}
# try serving static files directly
location ~ /\+f/ {
# workaround to pass non-GET/HEAD requests through to the named location below
error_page 418 = @proxy_to_app;
if ($request_method !~ (GET)|(HEAD)) {
return 418;
}
expires max;
# in case we let nginx cache responses, we disable it here.
# if you use a backend which doesn't have files on the filesystem
# or your nginx can't access them, set it to 0 instead:
set $bypass_caching 1;
try_files /+files$uri @proxy_to_app;
}
# try serving docs directly
location ~ /\+doc/ {
# if the --documentation-path option of devpi-web is used,
# then the root must be set accordingly here
root /Users/fschulze/.devpi/server;
# in case we let nginx cache responses, we disable it here.
# if you use a backend which doesn't have files on the filesystem
# or your nginx can't access them, set it to 0 instead:
set $bypass_caching 1;
try_files $uri @proxy_to_app;
}
location / {
# workaround to pass all requests to / through to the named location below
error_page 418 = @proxy_to_app;
return 418;
}
location @proxy_to_app {
# use the keys_zone defined above
proxy_cache devpi_cache;
proxy_cache_bypass $bypass_caching;
add_header X-Cached $upstream_cache_status;
# adjust the life time to your liking, by default it matches
# the mirror cache expiry setting
proxy_cache_valid 200 1800s;
proxy_pass http://localhost:3141;
proxy_set_header X-Forwarded-Proto $x_scheme;
proxy_set_header X-outside-url $x_scheme://$http_host;
proxy_set_header X-Real-IP $remote_addr;
}
}
I like it!!
Something I think I noticed in testing -- if devpi timed out accessing upstream pypi.org, then it returned a "stale" result per https://github.com/devpi/devpi/blob/2d80b7be410a18f01068c10d375456a20f405a76/server/CHANGELOG#L116.
Subsequent runs would serve the stale content until the nginx cache expired. I wonder if a future version of devpi could return a cache control header disabling caching when it is serving stale content?
Can you try with my temp branch?
Can you try with my temp branch?
Working well so far, though I'm not sure I've observed an upstream timeout again. I'll reduce the devpi timeout and try to reproduce.
You could force one by editing /etc/hosts
to point pypi.org to 127.0.0.1, or you could change the mirror_url
to example.com.
You could force one by editing
/etc/hosts
to point pypi.org to 127.0.0.1, or you could change themirror_url
to example.com.
I think the tricky bit is I need it to time out rather than return 404. I ended up using nc -l PORTNO
and then updated the mirror url to connect to that port. Seems to be working as far as I can tell!
- I noticed that if a particular package wasn't yet mirrored from the index, then after a timeout you would get a 404. Makes sense, and it wouldn't be cached anyway. One thing though -- the timeout for this particular case didn't seem to obey the
--request-timeout
option, it was 30 seconds even though I had configured it to be 1 second. - If devpi couldn't connect upstream, you would get a 500 right away rather than serving stale links. Not saying it shouldn't do that, just an observation.
- When I set up the
netcat
listener and used a package that was in the mirrored index, it would return the page withPragma: no-cache
, which I think is the new feature you added?
* When I set up the `netcat` listener and used a package that was in the mirrored index, it would return the page with `Pragma: no-cache`, which I think is the new feature you added?
Yes, and together with the new nginx caching config it should prevent the page from being cached.
Have to look into the rest at some point.