bandersnatch
bandersnatch copied to clipboard
Enhancement: Filter packages with very long versions
There is currently a package "uselesscaptialquiz" that has a very long version name. So long that sync'ing the package fails on Ubuntu 20.04 because the result package file name is too long for the OS, and the mirror fails.
Request: add filtering option to ignore packages which have version names so long that the OS gives an error when sync'ing.
What do you exactly want here? I guess you want automatic catching of the OSError (I guess it is) exception when this occurs and just print an error saying we're ignoring package X cause it's naming/versioning is to long for the storage?
For now to avoid the error. you could just deny list it: https://bandersnatch.readthedocs.io/en/latest/filtering_configuration.html#allowlist-blocklist-filtering-settings
Sure, automatically catching and ignoring that error would work.
The underlying problem is that PyPi doesn’t stop people from creating packages with ridiculously long versions.
Thanks!
Craig
-- Craig Anderson @.***
From: Cooper Lees @.> Date: Friday, September 30, 2022 at 1:58 PM To: pypa/bandersnatch @.> Cc: Anderson, Craig @.>, Author @.> Subject: Re: [pypa/bandersnatch] Enhancement: Filter packages with very long versions (Issue #1228) [External Email]
What do you exactly want here? I guess you want automatic catching of the OSError (I guess it is) exception when this occurs and just print an error saying we're ignoring package X cause it's naming/versioning is to long for the storage?
For now to avoid the error. you could just deny list it: https://bandersnatch.readthedocs.io/en/latest/filtering_configuration.html#allowlist-blocklist-filtering-settingshttps://urldefense.proofpoint.com/v2/url?u=https-3A__bandersnatch.readthedocs.io_en_latest_filtering-5Fconfiguration.html-23allowlist-2Dblocklist-2Dfiltering-2Dsettings&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=tx3oo5yluaJ5Bt9-OXoA43fkLAOhFPeA691enGAmkJ75hQ4SEpCJI_Ky3SzgK9Le&s=0tstGtXhQ8rbBWaOYrUQSigYMlfIJdl_Phr-SdJh4Bc&e=
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pypa_bandersnatch_issues_1228-23issuecomment-2D1264018324&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=tx3oo5yluaJ5Bt9-OXoA43fkLAOhFPeA691enGAmkJ75hQ4SEpCJI_Ky3SzgK9Le&s=kNaCrJwqbi5pX2-KXrSEFlw37_UxkE37SEIxpeb9GMQ&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AVER4OJA3744OGFKRFNUFFTWA5H55ANCNFSM6AAAAAAQ2DKTMQ&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=tx3oo5yluaJ5Bt9-OXoA43fkLAOhFPeA691enGAmkJ75hQ4SEpCJI_Ky3SzgK9Le&s=NBh-gd2qaz_Hmaa686heztaLgLwivrEiOsRTMybWY6M&e=. You are receiving this because you authored the thread.Message ID: @.***>
- CONFIDENTIAL-
This email and any files transmitted with it are confidential, and may also be legally privileged. If you are not the intended recipient, you may not review, use, copy, or distribute this message. If you receive this email in error, please notify the sender immediately by reply email and then delete this email.
I would accept a PR doing this with appropriate unit testing showing the behavior.
Agree PyPI should be more strict there. Have you search / opened an issue there? i.e. https://github.com/pypi/warehouse/issues
I’ll work on that PR.
And I’ll submit a PyPi issue if it doesn’t already exist.
Regards, Craig
-- Craig Anderson @.***
From: Cooper Lees @.> Date: Friday, September 30, 2022 at 2:04 PM To: pypa/bandersnatch @.> Cc: Anderson, Craig @.>, Author @.> Subject: Re: [pypa/bandersnatch] Enhancement: Filter packages with very long versions (Issue #1228) [External Email]
I would accept a PR doing this with appropriate unit testing showing the behavior.
Agree PyPI should be more strict there. Have you search / opened an issue there? i.e. https://github.com/pypi/warehouse/issueshttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pypi_warehouse_issues&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=0lVGSFDgkmGsfJc2JPRqCStXLfL_lCOzqlxRJfuNBnb8ayPbFDoIeF2pyM7MDJ6V&s=aYksMS-0BJgoKO66xJJmmpVhyGQESjzjbbvoMZ-bDE0&e=
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_pypa_bandersnatch_issues_1228-23issuecomment-2D1264022810&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=0lVGSFDgkmGsfJc2JPRqCStXLfL_lCOzqlxRJfuNBnb8ayPbFDoIeF2pyM7MDJ6V&s=kuc9ZlCa1YTs6aI4m9U013EZVarC1uiaGfolw5aCM2s&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AVER4OIF7RSCDOXTEV6YBHLWA5IWDANCNFSM6AAAAAAQ2DKTMQ&d=DwMCaQ&c=Qwsh1H-X9ypOoLLEcAIltRyC0Dw0FG3Mmyd56ahml5w&r=IeOe_1i-fE4lTGcDyL6SeNFm2I5X2M8gL2A4KBIvJCE&m=0lVGSFDgkmGsfJc2JPRqCStXLfL_lCOzqlxRJfuNBnb8ayPbFDoIeF2pyM7MDJ6V&s=wtLaok13TsA92otKfNkq1GXYbltZUIYeMOiwZzInC6c&e=. You are receiving this because you authored the thread.Message ID: @.***>
- CONFIDENTIAL-
This email and any files transmitted with it are confidential, and may also be legally privileged. If you are not the intended recipient, you may not review, use, copy, or distribute this message. If you receive this email in error, please notify the sender immediately by reply email and then delete this email.
I've just run into this problem too. One big problem with it from a banderstatch point of view is that a new mirror will never get past serial 0 and all subsequent executions of bandersnatch mirror
will be looping over the same set of todo
packages to try (and fail) to get to serial X that was current at the time of the first execution.
Of course, that's the correct behaviour because bandersnatch has failed to get one of the todo packages due to an unhandled exception and I don't think any of us expected this to happen:
2022-10-26 08:41:43,150 INFO: Downloading: https://files.pythonhosted.org/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl (mirror.py:875)
2022-10-26 08:41:43,653 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl (mirror.py:686)
Traceback (most recent call last):
File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 662, in sync_release_files
downloaded_file = await self.download_file(
File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 892, in download_file
with self.storage_backend.rewrite(path, "wb") as f:
File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/opt/bandersnatch/src/bandersnatch_storage_plugins/filesystem.py", line 82, in rewrite
with tempfile.NamedTemporaryFile(
File "/usr/lib/python3.8/tempfile.py", line 679, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/usr/lib/python3.8/tempfile.py", line 389, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
OSError: [Errno 36] File name too long: '/mnt/mirrors/pypi/web/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/.uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl.7lqjgi8o'
2022-10-26 08:41:43,658 ERROR: Error syncing package: uselesscapitalquiz@14521754 (mirror.py:377)
Traceback (most recent call last):
File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 130, in package_syncer
await self.process_package(package)
File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 337, in process_package
await self.sync_release_files(package)
File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 693, in sync_release_files
raise deferred_exception # raise the exception after trying all files
File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 662, in sync_release_files
downloaded_file = await self.download_file(
File "/opt/bandersnatch/src/bandersnatch/mirror.py", line 892, in download_file
with self.storage_backend.rewrite(path, "wb") as f:
File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/opt/bandersnatch/src/bandersnatch_storage_plugins/filesystem.py", line 82, in rewrite
with tempfile.NamedTemporaryFile(
File "/usr/lib/python3.8/tempfile.py", line 679, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/usr/lib/python3.8/tempfile.py", line 389, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
OSError: [Errno 36] File name too long: '/mnt/mirrors/pypi/web/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/.uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl.7lqjgi8o'
I'm not sure how best to deal with this:
- bandersnatch could jump through hoops to accurately reflect the troll package. It could do something like store it in a shortened name, and update the metadata. I don't like this idea because it's making us do exceptional work for conrived troll packages.
- alternatively, bandersnatch could just not support filenames longer than X and drop the packages (treating the exception as OK, and not recording the issue as an error; perhaps just a warning on output). This isn't great as we'd probably be breaking various PEPs.
- we could leave it as it is, and it's up to the user to troubleshoot the issue and add offending packages to the disallowlist. This isn't very user friendly however; I've probably wasted about an hour on this already. 😃
- we could put packages that fail for unexpected reasons like this into a
problem_packages
file (liketodo
) and not record it as an error. This would allow the process to complete the loop and not view the outcome as a failure: the serial number will get updated,todo
will be empty,problem_packages
will reflect the fact that certain packages had issues and can be retried withbandersnatch sync
. I like this the best because it correctly records the minority of packages that didn't fully sync, allows easy remedial action, and allows bandersnatch to continue operating normally without being stuck on serial 0.
Thanks for sharing your experience.
Here are my thoughts on each of your suggestions:
- I do not wish to implement logic like this as it will hide problems and create new ones potentially
- As stated above, it would be nice to have PyPI not allow this naming/versioning at all
- We could just see the filename to long exception and then just state due to it, we're skipping he package in logs
- Long as we're not silent about it, I'm open to behavior like this ...
- I am sorry it ate time, but the error in the logs is pretty clear ...
- This just sounds like a separate denylist? Maybe we could just add a
--no-filtering
switch tosync
so people can still manually try denylisted packages via the sync sub command?
I'm really not annoyed about occasionally having to occasionally waste some time fixing problems with this. The amount of time I've saved with this project is immense, and I really like the project.
My suggestion of a problem_packages
file was not for the user to create an exception list (no, that would just be like the blocklist); rather I was suggesting that the tool could record the fact that the package had issues.
I don't know how other people use bandersnatch, but I find tracking problems quite difficult. The logs are very verbose! If I run bandersnatch mirror
in a screen terminal, then any errors that may have been encountered are long lost above the maximum scrollback. If I have an issue I've got to rerun it and pipe the logs somewhere and grep -v INFO
to get rid of all the noise, and just hope that the issue appears.