bandersnatch
bandersnatch copied to clipboard
Issues serving via S3 static website
I've run into an issue when trying to then pull packages from a bucket backed static site, but can't tell if the issue is my config a change in static sites behaviour (and how pip deals with it)
WARNING: Skipping page http://<bucket name.region>.amazonaws.com/mirror/web/simple/pillow/ because the GET request got Content-Type: binary/octet-stream. The only supported Content-Types are application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, and text/html
ERROR: Could not find a version that satisfies the requirement pillow (from versions: none)
ERROR: No matching distribution found for pillow
I notice that curling /web/simple/<package>/
returns a 302 which leads me to think this is more of a static site / pip handling issue that would affect the bandersnatch implementation:
<html>
<head><title>302 Moved Temporarily</title></head>
<body>
<h1>302 Moved Temporarily</h1>
<ul>
<li>Code: Found</li>
<li>Message: Resource Found</li>
<li>RequestId:</li>
<li>HostId:</li>
</ul>
<hr/>
</body>
</html>
My current deploymend of bandersnatch uses this template below as the base for the configuraiton:
[mirror]
directory = /{{ s3_bucket_name }}/{{ s3_file_prefix }}
storage-backend = s3
diff-file = /{{ s3_bucket_name}}/{{ s3_file_prefix }}/{{ s3_diff_file }}
json = false
master = https://pypi.org
timeout = 60
hash-index = false
workers = 6
stop-on-error = false
delete-packages = true
[s3]
region_name = {{ aws_region }}
aws_access_key_id = {{ s3_access_key }}
aws_secret_access_key = {{ s3_secret_key }}
endpoint_url = {{ s3_endpoint_url }}
signature_version = s3v4
[plugins]
enabled =
exclude_platform
allowlist_project
[blocklist]
platforms =
macos
freebsd
[allowlist]
packages =
{%+ for package in package_allowlist -%}{{ package }}
{% endfor %}
I'm wondering if this is misconfig on my part or maybe recent change on AWS side that just breaks this design.
This is definitely a serving configuration issue. You need to make the Content-Type:
s3 HTML headers send text/html
if you're serving a index.html or application/vnd.pypi.simple.v1+json
if you're seeing the json file to make pip happy ...
My quick search (linked above) says there is no default and you're somehow sending Content-Type: binary/octet-stream
. So correcting that should help fix the issue.
I'm happy to take documentation updates to https://bandersnatch.readthedocs.io/en/latest/storage_options.html#amazon-s3 - Source file if you feel our docs are lacking. I've sadly never setup a S3 based mirror so can not help much more here.
I've taken a second look at things with a fresh pair of eyes. Think you pointed in the right direction with the Content-Type
.
From what I can tell the bandersnatch s3 plugin isn't specifying a Mime type when doing a PutObject
to S3, which results in AWS giving the object the default of binary/octet-stream
:
aws s3api head-object --bucket <bucketname> --key web/simple/index.html
{
"AcceptRanges": "bytes",
"LastModified": "2023-11-27T12:24:18+00:00",
"ContentLength": 422,
"ETag": omitted,
"ContentType": "binary/octet-stream",
"ServerSideEncryption": "AES256",
"Metadata": {}
}
From some surface level digging it looks like S3Path
is being used to get the files to S3 and there's conversation about passing the Content-Type
as a parameter in an existing issue:
https://github.com/liormizr/s3path/issues/83#issuecomment-869729917
I sadly lack the talent and knowledge on bandersnatch to know how to go about fixing things. (If what I mention sounds right)
Ahh, it seems if it is set @ upload / write time, then this is indeed a bandersnatch bug. Nice find.
I'm asking on the issue if there are plans for a friendlier API and how do we edit existing files ContentType ...
You can use a CDN to provide service, which could be cheaper and content-type can also be changed
Use https://github.com/pottava/aws-s3-proxy and nginx to set content-type if you're using this for internal use only.
I also encountered this bug in the s3 server... until it's fixed I had to do a recursive fix of the content-types of the index.html pages in my bucket:
aws s3 cp \
s3://MY_BUCKET/data/web/simple/ \
s3://MY_BUCKET/data/web/simple/ \
--exclude '*' \
--include '*.html' \
--no-guess-mime-type \
--content-type="text/html" \
--metadata-directive="REPLACE" \
--recursive