bandersnatch
bandersnatch copied to clipboard
Problem with metadata fetching
I previously used Bandersnatch 4.4 to mirror specific packages, and it worked perfectly, but since I updated it to version 5.0, it does download the package releases into my allowlist but it downloads the metadata of all packages are worrying about my allowlist. I looked in the documentation but I don't understand how I can prevent it from downloading package metadata that I don't want to mirror, can you help me please?
Here's my config:
[plugins]
enabled =
blocklist_project
blocklist_release
whitelist_project
allowlist_release
exclude_platform
[blocklist]
plugins =
exclude_platform
platforms =
windows
macos
freebsd
[allowlist]
packages =
altgraph>=0.17
ansible>=2.9.12
asn1crypto>=0.24.0
bcrypt>=3.2.0
Cerberus>=1.3.2
certifi>=2018.8.24
cffi>=1.14.3
chardet>=3.0.4
chrome-gnome-shell>=0.0.0
colorclass>=2.2.0
cryptography>=2.6.1
cupshelpers>=1.0
distro>=1.3.0
distro-info>=0.21
easygui>=0.98.1
entrypoints>=0.3
httplib2>=0.11.3
idna>=2.6
importlib-metadata>=1.7.0
Jinja2>=2.11.2
jmespath>=0.10.0
jsonpickle>=1.4.1
keyring>=17.1.1
keyrings.alt>=3.1.1
MarkupSafe>=1.1.1
msoffcrypto-tool>=4.10.2
olefile>=0.46
oletools>=0.55.1
paho-mqtt>=1.5.0
paramiko>=2.7.2
pcodedmp>=1.2.6
pip>=18.1
prometheus-client>=0.8.0
psutil>=5.7.2
pycairo>=1.16.2
pycparser>=2.20
pycrypto>=2.6.1
pycups>=1.9.73
pycurl>=7.43.0.2
PyGObject>=3.30.4
pyinotify>=0.9.6
pyinstaller>=4.0
pyinstaller-hooks-contrib>=2020.8
pylru>=1.2.0
PyNaCl>=1.4.0
pyparsing>=2.4.7
pysftp>=0.2.9
PySimpleSOAP>=1.16.2
pysmbc>=1.0.15.6
pystemd>=0.7.0
python-apt>=1.8.4.1
python-debian>=0.1.35
python-debianbts>=2.8.2
python-magic>=0.4.18
pyxdg>=0.25
PyYAML>=5.3.1
raptorq>=1.4.2
reportbug>=7.5.3-deb10u1
requests>=2.21.0
SecretStorage>=2.3.1
setuptools>=40.8.0
six>=1.12.0
tornado>=6.0.4
typing-extensions>=3.6.4
unattended-upgrades>=0.1
Unidecode>=1.1.1
uptime>=3.0.1
urllib3>=1.24.1
wheel>=0.32.3
zipp>=3.1.0
zstandard>=0.14.0
paho-mqtt>=1.5.1
toml>=0.9.0
semantic-version>=2.6.0
setuptools-rust>=0.11.4
And the output when I execute bandersnatch mirror
2021-06-17 12:13:19,437 INFO: Selected storage backend: filesystem (configuration.py:126)
2021-06-17 12:13:19,437 INFO: Selected compare method: hash (configuration.py:172)
2021-06-17 12:13:19,574 INFO: Initialized project plugin blocklist_project, filtering [] (blocklist_name.py:27)
2021-06-17 12:13:19,669 INFO: Initialized release plugin allowlist_release, filtering [<Requirement('cryptography>=2.6.1')>, <Requirement('idna>=2.6')>, <Requirement('pcodedmp>=1.2.6')>, <Requirement('pygobject>=3.30.4')>, <Requirement('python-debianbts>=2.8.2')>, <Requirement('urllib3>=1.24.1')>, <Requirement('setuptools>=40.8.0')>, <Requirement('cupshelpers>=1.0')>, <Requirement('keyring>=17.1.1')>, <Requirement('psutil>=5.7.2')>, <Requirement('uptime>=3.0.1')>, <Requirement('easygui>=0.98.1')>, <Requirement('pysftp>=0.2.9')>, <Requirement('entrypoints>=0.3')>, <Requirement('wheel>=0.32.3')>, <Requirement('pycairo>=1.16.2')>, <Requirement('pysmbc>=1.0.15.6')>, <Requirement('pystemd>=0.7.0')>, <Requirement('distro>=1.3.0')>, <Requirement('pyinstaller>=4.0')>, <Requirement('toml>=0.9.0')>, <Requirement('prometheus-client>=0.8.0')>, <Requirement('colorclass>=2.2.0')>, <Requirement('keyrings-alt>=3.1.1')>, <Requirement('typing-extensions>=3.6.4')>, <Requirement('msoffcrypto-tool>=4.10.2')>, <Requirement('olefile>=0.46')>, <Requirement('pip>=18.1')>, <Requirement('python-magic>=0.4.18')>, <Requirement('six>=1.12.0')>, <Requirement('asn1crypto>=0.24.0')>, <Requirement('raptorq>=1.4.2')>, <Requirement('pycrypto>=2.6.1')>, <Requirement('pylru>=1.2.0')>, <Requirement('paho-mqtt>=1.5.1')>, <Requirement('jsonpickle>=1.4.1')>, <Requirement('pyinotify>=0.9.6')>, <Requirement('ansible>=2.9.12')>, <Requirement('requests>=2.21.0')>, <Requirement('tornado>=6.0.4')>, <Requirement('pycups>=1.9.73')>, <Requirement('distro-info>=0.21')>, <Requirement('pyparsing>=2.4.7')>, <Requirement('altgraph>=0.17')>, <Requirement('semantic-version>=2.6.0')>, <Requirement('paho-mqtt>=1.5.0')>, <Requirement('pyyaml>=5.3.1')>, <Requirement('markupsafe>=1.1.1')>, <Requirement('bcrypt>=3.2.0')>, <Requirement('jmespath>=0.10.0')>, <Requirement('setuptools-rust>=0.11.4')>, <Requirement('pynacl>=1.4.0')>, <Requirement('zstandard>=0.14.0')>, <Requirement('unidecode>=1.1.1')>, <Requirement('jinja2>=2.11.2')>, <Requirement('pycurl>=7.43.0.2')>, <Requirement('cerberus>=1.3.2')>, <Requirement('chrome-gnome-shell>=0.0.0')>, <Requirement('reportbug>=7.5.3-deb10u1')>, <Requirement('paramiko>=2.7.2')>, <Requirement('chardet>=3.0.4')>, <Requirement('zipp>=3.1.0')>, <Requirement('certifi>=2018.8.24')>, <Requirement('pycparser>=2.20')>, <Requirement('pyxdg>=0.25')>, <Requirement('python-debian>=0.1.35')>, <Requirement('httplib2>=0.11.3')>, <Requirement('oletools>=0.55.1')>, <Requirement('secretstorage>=2.3.1')>, <Requirement('cffi>=1.14.3')>, <Requirement('pysimplesoap>=1.16.2')>, <Requirement('pyinstaller-hooks-contrib>=2020.8')>, <Requirement('unattended-upgrades>=0.1')>, <Requirement('python-apt>=1.8.4.1')>, <Requirement('importlib-metadata>=1.7.0')>] (allowlist_name.py:170)
2021-06-17 12:13:19,672 INFO: Initialized release plugin blocklist_release, filtering [] (blocklist_name.py:110)
2021-06-17 12:13:19,687 INFO: Initialized exclude_platform plugin with ['.win32', '-win32', 'win_amd64', 'win-amd64', 'macosx_', 'macosx-', '.freebsd', '-freebsd'] (filename_name.py:85)
2021-06-17 12:13:19,937 INFO: Status file /home/admsrv/.bandersnatch/status missing. Starting over. (mirror.py:594)
2021-06-17 12:13:19,937 INFO: Syncing with https://pypi.org. (mirror.py:58)
2021-06-17 12:13:19,937 INFO: Current mirror serial: 0 (mirror.py:263)
2021-06-17 12:13:19,937 INFO: Resuming interrupted sync from local todo list. (mirror.py:270)
2021-06-17 12:13:21,566 INFO: Trying to reach serial: 10666172 (mirror.py:295)
2021-06-17 12:13:21,566 INFO: 277245 packages to sync. (mirror.py:297)
2021-06-17 12:13:21,566 INFO: No metadata filters are enabled. Skipping metadata filtering (mirror.py:77)
Hi,
Thanks for reporting.
Just to clarify, when you say "metadata" do you mean bandersnatch saves the JSON file to disk for every package on PyPI and not just what's in your Allow List?
Thanks.
Hi, Yes exactly
I was investigating a different issue yesterday using the allowlist
plugin to limit my downloads to only 3-4 packages. I was using the master branch, not the 5.0.0 tag, but in those tests, the tool only downloaded json files for the packages I expected.
It looks like you might be setting up the exclude_platform
plugin incorrectly in your config file. You have the plugins = exclude_platform
under both the [plugins]
and the [blocklist]
sections. Not sure if/how that could be related, but its and observation. Check out https://bandersnatch.readthedocs.io/en/latest/filtering_configuration.html#platform-specific-binaries-filtering for the latest filter config syntax.
Other observations about your config file.
I think you no longer need whitelist_project
since you have converted to the allowlist
naming
Your bandersnatch output seems to indicate the block_package
and block_release
filters are not doing anything
@Lulu300 I know this is the same question you already answered, but I want to be sure I understand your problem. When you say it "downloads the metadata of all packages", does that mean
- metadata for every version of the packages in your allow list
- metadata for every package on pypi
In the first case you would have about 75 files in the json folder and in the second case you would have over 400,000 files.
I did some more testing over the weekend with different filter combinations mirror a single version of a package:
[allowlist] packages=jsonschema==3.2.0
.
- When I enabled only the
allowlist_project
filter, the version number is ignored and the everyjsonschema
release is downlaoded. - When I enabled both
allowlist_project
andallowlist_release
, I got the intended behavior (download 1 version of 1 package). - When I enabled only the
allowlist_release
(i.e. and notallowlist_project
) filter, the package name is ignored and bandersnatch begins creating index.html files for EVERY PACKAGE on PyPI. I did not want to wait for all the indexes to be created, so I changed my specified package to be0rss==1.1.0
and the tool did properly filter the name/version so that it only downloaded the one specified. I am guessing that if I had left it run, it would have also created json files for every package.
@Lulu300 According to the CHANGES.md, the blacklist/whitelist names will no longer work in the 5.0 configuration file. If you edit your config to replace whitelist_project
with allowlist_project
, that may solve your issue.
FWIW, a PR fixing the filtering to include JSON metadata files saving is welcome.
I am still trying to find time to dig into how the filtering works, but it seems to me the allowlist_release should either require or imply the allowlist_package filter. My first thought was that AllowListRelease class should inherit from AllowListProject, not FilterReleasePlugin, but if there was a way to implicitly enable the package filter, that would work too.
I hope to find time to address this as well as the issue I found with verify, but my free time mostly comes after 10PM when ambition and focus are low.