dream2nix
dream2nix copied to clipboard
relaunch nix-pypi-fetcher
The maximum size quota of the repo is reached: https://github.com/DavHau/nix-pypi-fetcher/actions/runs/3556707373/jobs/5974260270#step:4:3180
Solve this by:
- creating a new repo: nix-pypi-fetcher-2
- creating a commit on the old repo which throws an error when the default.nix is imported
I'd like to point out that at the current rate (one instance is about 200 MB, twice a day > 400MB/day), and with git essentially having no'diff' based sharing between revisions - we're going to be here again in less than 250 days.
Why is there no diff based sharing. I thought git only stores the diff, and therefore it's more efficient than that.
BTW, today I put up https://github.com/DavHau/nix-pypi-fetcher-2
Re: does git store diff/patches
Let me refer to that mightiest of reference works, which has poisoned the search engines so magnificently, StackOverflow:
https://stackoverflow.com/questions/40617288/a-commit-in-git-is-it-a-snapshot-state-image-or-is-it-a-change-diff-patch-delta/65375440#65375440
Do we have to be able to read the data with nix, or 'just' with python for mach-nix?
@TyberiusPrime In dream2nix we start using it as well and are reading with nix.
@DavHau @TyberiusPrime Instead of rewriting all files, we could experiment with formats where existing dumps are unchanged and info for new versions ends up in new files.
Expressing diffs withing the repo as files will impact evaluation performance as nix then has to merge attribute sets. We could have one file per package name. That would minimize the git commit size significantly I assume, because only a fraction of the packages on pypi are actively maintained. All the rest would never or rarely be changed.
Though there is also a cost to having many small files. Looking at all-cabal-json for example, there we have 3 files per package release, which is more than 500k files in total. Despite its tarball being 30% smaller than nix-pypi-fetcher's it takes more than twice the time to unpack (I'm running this on a modern SSD):
grmpf@grmpf-nix /t/test> ls -lah
total 485M
...
-rw-r--r-- 1 grmpf users 194M Dec 8 23:19 all-cabal-json.tar.gz
-rw-r--r-- 1 grmpf users 291M Dec 8 23:17 nix-pypi-fetcher.tar.gz
grmpf@grmpf-nix /t/test> time tar xf all-cabal-json.tar.gz
________________________________________________________
Executed in 27.61 secs fish external
usr time 28.55 secs 0.00 millis 28.55 secs
sys time 9.78 secs 1.86 millis 9.77 secs
grmpf@grmpf-nix /t/test> time tar xf nix-pypi-fetcher.tar.gz
________________________________________________________
Executed in 12.45 secs fish external
usr time 12.19 secs 1.03 millis 12.19 secs
sys time 2.67 secs 1.01 millis 2.67 secs
grmpf@grmpf-nix /t/test> find all-cabal-json-hackage/ | wc -l
537384
grmpf@grmpf-nix /t/test> find nix-pypi-fetcher-master/ | wc -l
276
Pypi currently has 433,794 packages, therefore, having one file per project would result in similar unpack performance like all-cabal-json.
Maybe some solution that combines the best of both world would be optimal, like for example:
- reduce the update interval to once a day instead of twice a day.
- Keep the current structure with 256 files
- for the daily updates, add individual files to the repo containing only changes
- once per month, aggregate the small files and add the information to the 256 large fields
With this we would still have regular updates, but the large files would only be touched every once in a while, which should reduce the total size of the repo history.
Sounds like a plan.
For lots of files, I'd suggest a prefix tree directory structure - my nix store gets troublesome when it exceeds a few 100k derivations.
@DavHau @TyberiusPrime How about:
- one file per package and release containing hashes for all its files - this should never change again
- use prefix tree directory structure as suggested by @TyberiusPrime
- don't unpack the full tarball, but only extract the exact file needed?
What queries do we have against the nix-pypi-fetcher db?
What queries do we have against the nix-pypi-fetcher db?
Usually we know the list of packages that we want to query upfront, so this could work in principle.
The problem with that is, that nix itself does not have an operation for extracting a single file from a tarball. We would have to use an IFD for that, which would impact performance and have some other unwanted side-effects.
Another way to implement this, could be to create a new FOD fetcher, which gets as an input the nix-pypi-fetcher.tar.gz, and the metadata for the requested package, then extracts only that information within the FOD and fetches it. That prevents any usage of IFD for that operation and should still be efficient. We might want to use the .zip file instead of the tarball, as tarballs don't have an index and random access is more expensive than with .zip.
Another way to implement this, could be to create a new FOD fetcher, which gets as an input the nix-pypi-fetcher.tar.gz, and the metadata for the requested package, then extracts only that information within the FOD and fetches it. That prevents any usage of IFD for that operation and should still be efficient. We might want to use the
.zipfile instead of the tarball, as tarballs don't have an index and random access is more expensive than with .zip.
OOps, that won't work, because we need to know the hash upfront, before constructing the FOD :/
Ok, so the naive implementation of one file per package in a prefix tree, named 'package.json' more than doubles the size of the zip file, and the unpack time goes from 5 seconds to 45. (on tmpfs / ramdisk).
Using 'ckage' as the file name makes on difference to the zip size.
A tar with those files is only 97M. I think tar is able to compress across files while zip only compresses within a file (=not at all in this case). The tar ball unpacks in 25 seconds.
But just for storage space in the git repo, we might gzip the individual 256 files we have right now? That'd be transparent for python, not sure about nixlang. We gain a factor of 2.
The only way I see around 'not fetching the complete database' is if we turn it into a multi level thing... first level (the hash the user enters) provides all the FOD-hashes for the second level with the actual data.
Sharding would take some creativity though.
Just 1st+2nd letter causes very different sized blocks - min 1K, max 26M (way too many packages with 'py' ;)).
Edit: Sharding by python version: Probably not worth it.
But replacing {"27": {...}, "36": "27", "37": "27", "38": "27", "39": "27", "310": "27"},' with 'any': {...27's contents'...} might be? It seems to happen in about 97% of the relevant cases. (I looked for 'all but one is a string', not equality to "27").
How about we don't use file content at all, but encode everything in the file names?
How about we don't use file content at all, but encode everything in the file names?
I like this idea!
An alternative. Have one repo (e.g. flake) that keeps track of the state. And then another n amount of contents repos containing the data of interest. Every week or so a new repo is created. The state repo is always aware of the contents repo it needs to refer and which revision.