bandersnatch icon indicating copy to clipboard operation
bandersnatch copied to clipboard

How can I ignore "*-nightly" packages and releases when using bandersnatch?

Open r00t1900 opened this issue 3 years ago • 4 comments

brief

I would like to ignore all nightly build type packages and releases when mirroring PyPi with bandersnatch.

description

Here I found that I did not need the nightly build version of each packages, and in https://pypi.org/stats it shows that about 1TB data are "-nightly-". I've manually add them to blocklist, but I would like to make it much more accuracy, to match and ignore downloading all nightly build data.

appeal

I find that we have regex plugin, but I did not know how to write the pattern, since my purpose it to ignore but not acquire and I don't know if we can make it much more easier in another way. Finally I decide like to seek for community and official suggestion and hope someone who know it can help me in this.

Looking forward to the answer.

r00t1900 avatar Mar 19 '22 11:03 r00t1900

Howdy, This is a good idea.

But to do this more accurately than a bunch of regexes we need metadata stored somewhere accessible @ pypi.org. Then bandersnatch can use that. Today (and I hope to be wrong) I know of no such metadata.

I quickly checked the JSON API for tf-nightly (the largest nightly package @ ~400gb) and there is nothing that indicates it is a nightly package. Adding such metadata would need to be a warehouse issue raised.

Potential Metadata Options

  • JSON API extension
    • Maybe add a "info" field bool for nightly or a package type
    • This would require users to specify it
  • Add a classifier for Nightly or release type
    • https://pypi.org/classifiers/
    • This would also rely on users to add the classifier

Any ideas other people reading?

cooperlees avatar Mar 20 '22 18:03 cooperlees

Howdy, This is a good idea.

But to do this more accurately than a bunch of regexes we need metadata stored somewhere accessible @ pypi.org. Then bandersnatch can use that. Today (and I hope to be wrong) I know of no such metadata.

I quickly checked the JSON API for tf-nightly (the largest nightly package @ ~400gb) and there is nothing that indicates it is a nightly package. Adding such metadata would need to be a warehouse issue raised.

Potential Metadata Options

  • JSON API extension

    • Maybe add a "info" field bool for nightly or a package type
    • This would require users to specify it
  • Add a classifier for Nightly or release type

    • https://pypi.org/classifiers/
    • This would also rely on users to add the classifier

Any ideas other people reading?

That would be very nice. But adding meta info to all packages maybe really a huge project, will pypi accept this?

r00t1900 avatar Apr 03 '22 15:04 r00t1900

[plugins]
enabled =
    regex_project
    blocklist_project
    prerelease_release

[filter_regex]
packages =
    .+-nightly(-|$)

[blocklist]
packages =
    uselesscapitalquiz

[filter_prerelease]
packages =
    duckdb
    graphscope-client
    lalsuite
    gs-engine
    gs-include
    bigdl-dllib
    bigdl-dllib-spark2
    bigdl-dllib-spark3

Some metadata would be nice, I'd suggest PyPA to enforce some naming convention or metadata label for projects with constantly frequent releasing, especially with relatively large sizes. In case of other sound use cases, an request can be filed in warehouse like those for size limits.

In the meantime, I'll be using the config above, excluding all *-nightly-* and *-nightly, and some handpick awful projects spamming their pre-releases with nightly or even commit-ly builds. The uselesscapitalquiz causes file name length overflow.

TechCiel avatar Jan 09 '23 20:01 TechCiel

Though we have size limits in place (per project and per file), but we have no traffic limit... So constantly refreshing a relatively large project will incur huge traffic for mirrors, 10 builds of 500MiB is much more horrific than one 2GiB build.

By this, for example, I mean duckdb, do release for literally each commit.

TechCiel avatar Jan 09 '23 20:01 TechCiel