devpi Replicas with --requests-only

Hi :)

I have just noticed that there is duplicate documentation about the --requests-only feature:

First occurrence:

makes it sounds like the feature is well suited to satisfy “pip install” requests
https://devpi.net/docs/devpi/devpi/latest/+doc/adminman/server.html#multi-process-high-performance-setups

Second occurrence:

marks the feature as experimental
warns that the feature is not suitable for even for package downloads due to potential concurrent writes
https://devpi.net/docs/devpi/devpi/latest/+doc/adminman/server.html#multiple-server-instances

From the way I understand the Devpi transaction code I would believe that downloads should be possible even with sqlite and the second occurrence is outdated?

Apr 23 '18 16:04 StephanErb

The latter one still stands. It will work if no mirror index is involved. If you only use private indexes and don't base on root/pypi or a custom mirror, then there will only be reads. As soon as there is a mirror index involved, even a read of a simple index page may cause a write to update the data from the mirror after the cache has expired and the available releases on the mirror have changed.

I have some ideas which could potentially drastically improve concurrent write performance when using postgresql, but they require quite some refactoring. I'm planning to write a prototype to test the performance, but haven't got time for that yet.

Apr 25 '18 11:04 fschulze

Forgot to add: That same backend change would make multiple instances more performant as well.

Apr 25 '18 11:04 fschulze

What would happen if one would use the feature even with root/pypi or other mirrors? Should we expect crashes or even corrupt data?

We have been using the following setup for a few days and have not noticed any issues so far:

nginx as load balancer
- forwards GET requests from pip/setuptools to eight worker instances with --mode replica --requests-only
- forwards everything else from to one ordinary replica instance with --mode replica
all nine replica have the same master process configured. This is an ordinary master so it runs without the requests only flag.

I would expect that only the master triggers the write operations. Those are then made available to the ordinary replicas via the +changelog mechanism. The request-only replicas learn about the newly replicated data once they open their next read transaction, but they will never perform write requests themselves.

So over all, I am not expecting any write conflicts as there will always only be on writer per database. Am I missing something? Thanks!

Apr 25 '18 13:04 StephanErb

I'd say your thinking is correct. I hadn't thought of using replicas with --requests-only, good idea! Are you using devpi-web in that setup? If you do, then I'd suspect that the search and descriptions don't work as expected, as there is no worker thread to process the events.

Apr 25 '18 14:04 fschulze

Hello @StephanErb, I am wondering if this configuration is still working well for you? Also, how are the --serverdir setup for your instances? Do the master and the 'ordinary' replica have their own directories (say 'masterDir' and 'replicaDir') and the request-only replicas also point to replicaDir? Thank you!

Aug 10 '18 19:08 sdwilliams61

@sdwilliams61 yeah the config still works for us. We are using the following commandlines:

Worker replica (we run about 8 of those per box):

/opt/devpi-4/replica/venv/bin/python3.6 /opt/devpi-4/replica/venv/bin/devpi-server --port 3154 --serverdir /opt/devpi-4/replica/data --role replica --logger-cfg /opt/devpi-4/replica/logger.yaml --keyfs-cache-size 32768 --mirror-cache-expiry 1800 --replica-max-retries 3 --master-url https://<SNIP>:8443 --theme /opt/devpi-4/theme --requests-only --threads 1

Regular replica:

/opt/devpi-4/replica/venv/bin/python3.6 /opt/devpi-4/replica/venv/bin/devpi-server --port 3142 --serverdir /opt/devpi-4/replica/data --role replica --logger-cfg /opt/devpi-4/replica/logger.yaml --keyfs-cache-size 32768 --mirror-cache-expiry 1800 --replica-max-retries 3 --master-url https://<SNIP>:8443 --theme /opt/devpi-4/theme

Master:

/opt/devpi-4/master/venv/bin/python3.6 /opt/devpi-4/master/venv/bin/devpi-server --port 3141 --serverdir /opt/devpi-4/master/data --role master --logger-cfg /opt/devpi-4/master/logger.yaml --keyfs-cache-size 32768 --mirror-cache-expiry 1800 --replica-max-retries 3

I hope this allows you to extract the information you are looking for. Right now we hide the regular replica and master from users and only expose the workers via nginx as those offer the best performance.

Aug 16 '18 15:08 StephanErb

Excellent! Thank you very much.

Aug 16 '18 19:08 sdwilliams61

We have been using this configuration successfully for some months now. One problem we have occasionally is the corruption of the .nodeinfo file in /replica/data. The (worker) replicas each read-then-write this single file. In cases of poor spacing between replica restarts (read: nearly simultaneous), this issue crops up. (Does the .nodeinfo data ever change?)

Mar 14 '19 14:03 sdwilliams61

@sdwilliams61 we plan to get rid of most remaining data in .nodeinfo anyway. I'll look into the issue you describe, I have a suspicion on what is causing that.

Mar 14 '19 16:03 fschulze

The .nodeinfo corruption should be fixed by this commit https://github.com/devpi/devpi/commit/cef4ebeb5b33b24f281ded5d35534b8e99ef32dc in upcoming devpi-server 6.0.0

Apr 25 '20 09:04 fschulze

Looking into setting up multiple read replicas as well. Is --requests-only still experimental and not safe with mirrored indexes? What's the currently recommended setup if you need replicas due to load but still want to be able to have the search UI work?

Aug 31 '22 22:08 jdavisp3

I would try to avoid replicas for load balancing. They were added for globally distributed data centers.

For the pip install use case I would recommend serving release files by nginx as is already done with the default nginx config created by devpi-genconfig. Additionally I would cache all requests to +simple pages with some form of re-validation depending on how important picking up new releases in time is. Maybe add general caching depending on installer user agents to avoid load when the +simple part is missing in the index URL for the installer, see https://github.com/devpi/devpi/blob/d5212d1e21849f6d058f9514f8035a60d5cc1f12/server/devpi_server/views.py#L68-L69

With latest pip and devpi-server you can also add caching depending on the accept header for PEP-691 application/vnd.pypi.simple.v1+json on all pages to capture all possible variants of index URLs.

Sep 01 '22 05:09 fschulze

Thanks @fschulze, caching for installer user agents sounds promising, appreciate the suggestions!

Sep 01 '22 13:09 jdavisp3

So I think have something working similarly to what you describe. Since devpi doesn't set cache control headers on anything but files it seems, I need to proxy nginx to itself to add them. WDYT?

proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=devpi_cache:10m
                 inactive=60m use_temp_path=off;

upstream devpi {
    server ${WEB_DEVPI_SERVER};
}

map $http_x_forwarded_proto $thescheme {
    default $scheme;
    https https;
}

server {
    server_name ${WEB_SERVER_NAME};
    listen 8000;

    gzip on;
    gzip_min_length 2000;
    gzip_proxied any;
    gzip_types application/json;

    client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
    proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;

    root /devpi/;

    location ~ /\+f/ {
        error_page 418 = @proxy_to_app;
        if ($request_method !~ (GET)|(HEAD)) {
            return 418;
        }
        expires max;
        try_files /+files$uri @proxy_to_app;
    }

    location ~ /\+doc/ {
        try_files $uri @proxy_to_app;
    }

    location / {
        error_page 418 = @proxy_to_app;
        return 418;
    }

    set $no_cache 1;

    if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
    {
        set $no_cache 0;
    }

    location @proxy_to_app {
        add_header X-Cache-Status $upstream_cache_status;

        proxy_cache_bypass $no_cache;
        proxy_cache devpi_cache;

        proxy_set_header Host $http_host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-Proto $thescheme;
        proxy_set_header X-Outside-URL $thescheme://$http_host;
        proxy_pass http://127.0.0.1:8001;
    }
}

server {
    server_name ${WEB_SERVER_NAME};
    listen 8001;

    access_log off;

    client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
    proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;

    location / {
        if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
        {
            expires 1h;
        }
        proxy_pass http://devpi;
    }
}

Sep 03 '22 22:09 jdavisp3

I think you can use proxy_ignore_headers instead of recursive proxy.

You might want to add an if for /\+simple/? locations to enable caching.

Have you tested application/vnd.pypi.simple.v1+json from newer pip? It should produce different cache keys. You might want to add an if for that as well to explicitly enable caching.

Would you mind if I add this as an example to devpi genconfig when it works?

Sep 04 '22 08:09 fschulze

I think you can use proxy_ignore_headers instead of recursive proxy.

I tried that; I may have been doing something wrong. But I'm not sure ignoring headers will work since devpi isn't returning any cache control headers to be ignored. The issue seems to be that nginx proxy caching needs the upstream server to actually say something needs to be cached.

A future fix might be to have devpi optionally return cache control headers for responses other than files?

You might want to add an if for /\+simple/? locations to enable caching.

Have you tested application/vnd.pypi.simple.v1+json from newer pip? It should produce different cache keys. You might want to add an if for that as well to explicitly enable caching.

Both of those seem to work, though collectively they are somewhat redundant. Also, for my use case I'd rather not cache responses to browsers. I just need to cache responses for package installers to handle high load from them.

Would you mind if I add this as an example to devpi genconfig when it works?

Totally, I'll post a new version with the additional options for if blocks.

Sep 04 '22 16:09 jdavisp3

proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=devpi_cache:10m
                 inactive=60m use_temp_path=off;

upstream devpi {
    server ${WEB_DEVPI_SERVER};
}

map $http_x_forwarded_proto $thescheme {
    default $scheme;
    https https;
}

server {
    server_name ${WEB_SERVER_NAME};
    listen 8000;

    gzip on;
    gzip_min_length 2000;
    gzip_proxied any;
    gzip_types text/plain text/css text/xml
               application/json application/vnd.pypi.simple.v1+json
               application/javascript text/javascript
               application/xml application/xml+rss;

    client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
    proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;

    root /devpi/;

    location ~ /\+f/ {
        error_page 418 = @proxy_to_app;
        if ($request_method !~ (GET)|(HEAD)) {
            return 418;
        }
        expires max;
        try_files /+files$uri @proxy_to_app;
    }

    location ~ /\+doc/ {
        try_files $uri @proxy_to_app;
    }

    location / {
        error_page 418 = @proxy_to_app;
        return 418;
    }

    set $no_cache 1;

    if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
    {
        set $no_cache 0;
    }

    if ($request_uri ~ ".*/\+simple/")
    {
        set $no_cache 0;
    }

    if ($http_accept ~ ".*application/vnd\.pypi\.simple\.v1\+json")
    {
        set $no_cache 0;
    }

    location @proxy_to_app {
        add_header X-Cache-Status $upstream_cache_status;

        proxy_cache_bypass $no_cache;
        proxy_cache devpi_cache;

        proxy_set_header Host $http_host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-Proto $thescheme;
        proxy_set_header X-Outside-URL $thescheme://$http_host;
        proxy_pass http://127.0.0.1:8001;
    }
}

server {
    server_name ${WEB_SERVER_NAME};
    listen 8001;

    access_log off;

    client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
    proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;

    set $no_cache 1;

    if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
    {
        set $no_cache 0;
    }

    if ($request_uri ~ ".*/\+simple/")
    {
        set $no_cache 0;
    }

    if ($http_accept ~ ".*application/vnd\.pypi\.simple\.v1\+json")
    {
        set $no_cache 0;
    }

    set $expires_value "1h";
    if ($no_cache) {
        set $expires_value "off";
    }

    location / {
        expires $expires_value;
        proxy_pass http://devpi;
    }
}

Sep 04 '22 16:09 jdavisp3

I think I plan to use a simpler version that just caches based on user agent. The reason we need a cache at all is the load from lots of CI/CD builds happening together. I don't see any need to cache from just people browsing around or curling, say, which just doesn't happen that much.

proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=devpi_cache:10m
                 inactive=10m use_temp_path=off;

upstream devpi {
    server ${WEB_DEVPI_SERVER};
}

map $http_x_forwarded_proto $thescheme {
    default $scheme;
    https https;
}

server {
    server_name ${WEB_SERVER_NAME};
    listen 8000;

    gzip on;
    gzip_min_length 2000;
    gzip_proxied any;
    gzip_types text/plain text/css text/xml
               application/json application/vnd.pypi.simple.v1+json
               application/javascript text/javascript
               application/xml application/xml+rss;

    client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
    proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;

    root /devpi/;

    location ~ /\+f/ {
        error_page 418 = @proxy_to_app;
        if ($request_method !~ (GET)|(HEAD)) {
            return 418;
        }
        expires max;
        try_files /+files$uri @proxy_to_app;
    }

    location ~ /\+doc/ {
        try_files $uri @proxy_to_app;
    }

    location / {
        error_page 418 = @proxy_to_app;
        return 418;
    }

    set $no_cache 1;

    if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
    {
        set $no_cache 0;
    }

    location @proxy_to_app {
        add_header X-Cache-Status $upstream_cache_status;

        proxy_cache devpi_cache;
        proxy_cache_bypass $no_cache;

        proxy_set_header Host $http_host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-Proto $thescheme;
        proxy_set_header X-Outside-URL $thescheme://$http_host;

        proxy_pass http://127.0.0.1:8001;
    }
}

server {
    server_name ${WEB_SERVER_NAME};
    listen 8001;

    access_log off;

    client_max_body_size ${WEB_CLIENT_MAX_BODY_SIZE};
    proxy_read_timeout ${WEB_PROXY_TIMEOUT}s;

    set $no_cache 1;

    if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
    {
        set $no_cache 0;
    }

    set $expires_value "10m";
    if ($no_cache) {
        set $expires_value "off";
    }

    location / {
        expires $expires_value;
        proxy_pass http://devpi;
    }
}

Sep 04 '22 20:09 jdavisp3

Testing with a local index seems to work. I uploaded an internal package to a local index and then did a pip install of it. Then I uploaded a new version and did a pip install --upgrade of the same. For the duration of the cache lifetime, the upgrade did nothing, then afterwards it found the new version and installed it.

Sep 04 '22 20:09 jdavisp3

The key for caching without recursive proxy seems to be: proxy_cache_valid 200 10m;

The current output I have for devpi-genconfig:

# adjust to your system and liking
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=devpi_cache:10m inactive=1800s use_temp_path=off;

map $http_x_forwarded_proto $x_scheme {
    default $scheme;
    http http;
    https https;
}

server {
    server_name localhost $hostname "";
    listen 80;
    gzip             on;
    gzip_min_length  2000;
    gzip_proxied     any;
    # add application/vnd.pypi.simple.v1+json to the gzip_types
    gzip_types  text/plain text/css text/xml
                application/json application/vnd.pypi.simple.v1+json
                application/javascript text/javascript
                application/xml application/xml+rss;

    proxy_read_timeout 60s;
    client_max_body_size 64M;

    # set to where your devpi-server state is on the filesystem
    root /Users/fschulze/.devpi/server;

    # by default we bypass the cache
    set $no_cache 1;

    # if we detect a known installer, we enable caching
    if ($http_user_agent ~* "([^ ]* )*(distribute|setuptools|pip|pex)/.*")
    {
        set $no_cache 0;
    }

    # for https://peps.python.org/pep-0691/ we also enable caching
    if ($http_accept ~ ".*application/vnd\.pypi\.simple\.v1\+json")
    {
        set $no_cache 0;
    }

    # try serving static files directly
    location ~ /\+f/ {
        # workaround to pass non-GET/HEAD requests through to the named location below
        error_page 418 = @proxy_to_app;
        if ($request_method !~ (GET)|(HEAD)) {
            return 418;
        }

        expires max;
        try_files /+files$uri @proxy_to_app;
    }

    # try serving docs directly
    location ~ /\+doc/ {
        # if the --documentation-path option of devpi-web is used,
        # then the root must be set accordingly here
        root /Users/fschulze/.devpi/server;
        try_files $uri @proxy_to_app;
    }

    location / {
        # workaround to pass all requests to / through to the named location below
        error_page 418 = @proxy_to_app;
        return 418;
    }

    location @proxy_to_app {
        # use the keys_zone defined above
        proxy_cache devpi_cache;
        proxy_cache_bypass $no_cache;
        add_header X-Cached $upstream_cache_status;
        # adjust to your liking
        proxy_cache_valid 200 1800s;
        proxy_pass http://localhost:3141;
        proxy_set_header X-Forwarded-Proto $x_scheme;
        proxy_set_header X-outside-url $x_scheme://$http_host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

See https://github.com/fschulze/devpi/commit/a341a59f5e0533e15f1282de8c1528936fc5ec1b

Sep 05 '22 09:09 fschulze

Nice, proxy_cache_valid seems to be doing the trick!

Sep 05 '22 15:09 jdavisp3

After some experimentation, it occurs to me that files should never be cached by nginx, since they get cached on the filesystem. WDYT about adding this as the last if block?

    # do not cache files
    if ($request_uri ~ ".*/\+f/")
    {
        set $no_cache 1;
    }

Sep 05 '22 22:09 jdavisp3

It is better to set it in the location, see https://www.nginx.com/resources/wiki/start/topics/tutorials/config_pitfalls/ and https://www.nginx.com/resources/wiki/start/topics/depth/ifisevil/

I also renamed the variable and switched to using map.

You could move the ifs to the / location and remove the set $bypass_caching 1; from the other locations for a more efficient config. I kept them there with a comment for other use cases.

# adjust the path for your system,
# the size (in keys_zone) and the life time to your liking,
# by default the life time matches the mirror cache expiry setting
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=devpi_cache:10m inactive=1800s use_temp_path=off;

map $http_user_agent $devpi_installer_agent {
        default         0;
        ~*distribute/   1;
        ~*setuptools/   1;
        ~*pip/          1;
        ~*pex/          1;
}

map $http_accept $devpi_installer_accept {
        default                                     0;
        ~*application/vnd\.pypi\.simple\.v1\+json   1;
}

map $http_x_forwarded_proto $x_scheme {
    default $scheme;
    http http;
    https https;
}

server {
    server_name localhost $hostname "";
    listen 80;
    gzip             on;
    gzip_min_length  2000;
    gzip_proxied     any;
    # add application/vnd.pypi.simple.v1+json to the gzip_types
    gzip_types  text/plain text/css text/xml
                application/json application/vnd.pypi.simple.v1+json
                application/javascript text/javascript
                application/xml application/xml+rss;

    proxy_read_timeout 60s;
    client_max_body_size 64M;

    # set to where your devpi-server state is on the filesystem
    root /Users/fschulze/.devpi/server;

    # by default we bypass the cache
    set $bypass_caching 1;

    # if we detect a known installer, we enable caching
    if ($devpi_installer_agent)
    {
        set $bypass_caching 0;
    }

    # for https://peps.python.org/pep-0691/ we also enable caching
    if ($devpi_installer_accept)
    {
        set $bypass_caching 0;
    }

    # try serving static files directly
    location ~ /\+f/ {
        # workaround to pass non-GET/HEAD requests through to the named location below
        error_page 418 = @proxy_to_app;
        if ($request_method !~ (GET)|(HEAD)) {
            return 418;
        }

        expires max;
        # in case we let nginx cache responses, we disable it here.
        # if you use a backend which doesn't have files on the filesystem
        # or your nginx can't access them, set it to 0 instead:
        set $bypass_caching 1;
        try_files /+files$uri @proxy_to_app;
    }

    # try serving docs directly
    location ~ /\+doc/ {
        # if the --documentation-path option of devpi-web is used,
        # then the root must be set accordingly here
        root /Users/fschulze/.devpi/server;
        # in case we let nginx cache responses, we disable it here.
        # if you use a backend which doesn't have files on the filesystem
        # or your nginx can't access them, set it to 0 instead:
        set $bypass_caching 1;
        try_files $uri @proxy_to_app;
    }

    location / {
        # workaround to pass all requests to / through to the named location below
        error_page 418 = @proxy_to_app;
        return 418;
    }

    location @proxy_to_app {
        # use the keys_zone defined above
        proxy_cache devpi_cache;
        proxy_cache_bypass $bypass_caching;
        add_header X-Cached $upstream_cache_status;
        # adjust the life time to your liking, by default it matches
        # the mirror cache expiry setting
        proxy_cache_valid 200 1800s;
        proxy_pass http://localhost:3141;
        proxy_set_header X-Forwarded-Proto $x_scheme;
        proxy_set_header X-outside-url $x_scheme://$http_host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Sep 06 '22 13:09 fschulze

I like it!!

Sep 06 '22 15:09 jdavisp3

Something I think I noticed in testing -- if devpi timed out accessing upstream pypi.org, then it returned a "stale" result per https://github.com/devpi/devpi/blob/2d80b7be410a18f01068c10d375456a20f405a76/server/CHANGELOG#L116.

Subsequent runs would serve the stale content until the nginx cache expired. I wonder if a future version of devpi could return a cache control header disabling caching when it is serving stale content?

Sep 06 '22 16:09 jdavisp3

Can you try with my temp branch?

Sep 07 '22 10:09 fschulze

Can you try with my temp branch?

Working well so far, though I'm not sure I've observed an upstream timeout again. I'll reduce the devpi timeout and try to reproduce.

Sep 07 '22 14:09 jdavisp3

You could force one by editing /etc/hosts to point pypi.org to 127.0.0.1, or you could change the mirror_url to example.com.

Sep 07 '22 16:09 fschulze

You could force one by editing /etc/hosts to point pypi.org to 127.0.0.1, or you could change the mirror_url to example.com.

I think the tricky bit is I need it to time out rather than return 404. I ended up using nc -l PORTNO and then updated the mirror url to connect to that port. Seems to be working as far as I can tell!

I noticed that if a particular package wasn't yet mirrored from the index, then after a timeout you would get a 404. Makes sense, and it wouldn't be cached anyway. One thing though -- the timeout for this particular case didn't seem to obey the --request-timeout option, it was 30 seconds even though I had configured it to be 1 second.
If devpi couldn't connect upstream, you would get a 500 right away rather than serving stale links. Not saying it shouldn't do that, just an observation.
When I set up the netcat listener and used a package that was in the mirrored index, it would return the page with Pragma: no-cache, which I think is the new feature you added?

Sep 07 '22 23:09 jdavisp3

* When I set up the `netcat` listener and used a package that was in the mirrored index, it would return the page with `Pragma: no-cache`, which I think is the new feature you added?

Yes, and together with the new nginx caching config it should prevent the page from being cached.

Have to look into the rest at some point.

Sep 08 '22 05:09 fschulze

devpi devpi copied to clipboard

Replicas with --requests-only

devpi
devpi copied to clipboard