dream2nix icon indicating copy to clipboard operation
dream2nix copied to clipboard

relaunch nix-pypi-fetcher

Open DavHau opened this issue 2 years ago • 13 comments
trafficstars

The maximum size quota of the repo is reached: https://github.com/DavHau/nix-pypi-fetcher/actions/runs/3556707373/jobs/5974260270#step:4:3180

Solve this by:

  • creating a new repo: nix-pypi-fetcher-2
  • creating a commit on the old repo which throws an error when the default.nix is imported

DavHau avatar Nov 27 '22 08:11 DavHau

I'd like to point out that at the current rate (one instance is about 200 MB, twice a day > 400MB/day), and with git essentially having no'diff' based sharing between revisions - we're going to be here again in less than 250 days.

TyberiusPrime avatar Dec 08 '22 12:12 TyberiusPrime

Why is there no diff based sharing. I thought git only stores the diff, and therefore it's more efficient than that.

DavHau avatar Dec 08 '22 13:12 DavHau

BTW, today I put up https://github.com/DavHau/nix-pypi-fetcher-2

DavHau avatar Dec 08 '22 13:12 DavHau

Re: does git store diff/patches

Let me refer to that mightiest of reference works, which has poisoned the search engines so magnificently, StackOverflow:

https://stackoverflow.com/questions/40617288/a-commit-in-git-is-it-a-snapshot-state-image-or-is-it-a-change-diff-patch-delta/65375440#65375440

Do we have to be able to read the data with nix, or 'just' with python for mach-nix?

TyberiusPrime avatar Dec 08 '22 13:12 TyberiusPrime

@TyberiusPrime In dream2nix we start using it as well and are reading with nix.

@DavHau @TyberiusPrime Instead of rewriting all files, we could experiment with formats where existing dumps are unchanged and info for new versions ends up in new files.

chaoflow avatar Dec 08 '22 15:12 chaoflow

Expressing diffs withing the repo as files will impact evaluation performance as nix then has to merge attribute sets. We could have one file per package name. That would minimize the git commit size significantly I assume, because only a fraction of the packages on pypi are actively maintained. All the rest would never or rarely be changed.

Though there is also a cost to having many small files. Looking at all-cabal-json for example, there we have 3 files per package release, which is more than 500k files in total. Despite its tarball being 30% smaller than nix-pypi-fetcher's it takes more than twice the time to unpack (I'm running this on a modern SSD):

grmpf@grmpf-nix /t/test> ls -lah
total 485M
...
-rw-r--r--  1 grmpf users 194M Dec  8 23:19 all-cabal-json.tar.gz
-rw-r--r--  1 grmpf users 291M Dec  8 23:17 nix-pypi-fetcher.tar.gz

grmpf@grmpf-nix /t/test> time tar xf all-cabal-json.tar.gz
________________________________________________________
Executed in   27.61 secs    fish           external
   usr time   28.55 secs    0.00 millis   28.55 secs
   sys time    9.78 secs    1.86 millis    9.77 secs

grmpf@grmpf-nix /t/test> time tar xf nix-pypi-fetcher.tar.gz
________________________________________________________
Executed in   12.45 secs    fish           external
   usr time   12.19 secs    1.03 millis   12.19 secs
   sys time    2.67 secs    1.01 millis    2.67 secs

grmpf@grmpf-nix /t/test> find all-cabal-json-hackage/ | wc -l
537384

grmpf@grmpf-nix /t/test> find nix-pypi-fetcher-master/ | wc -l
276

Pypi currently has 433,794 packages, therefore, having one file per project would result in similar unpack performance like all-cabal-json.

Maybe some solution that combines the best of both world would be optimal, like for example:

  • reduce the update interval to once a day instead of twice a day.
  • Keep the current structure with 256 files
  • for the daily updates, add individual files to the repo containing only changes
  • once per month, aggregate the small files and add the information to the 256 large fields

With this we would still have regular updates, but the large files would only be touched every once in a while, which should reduce the total size of the repo history.

DavHau avatar Dec 08 '22 16:12 DavHau

Sounds like a plan.

For lots of files, I'd suggest a prefix tree directory structure - my nix store gets troublesome when it exceeds a few 100k derivations.

TyberiusPrime avatar Dec 08 '22 17:12 TyberiusPrime

@DavHau @TyberiusPrime How about:

  • one file per package and release containing hashes for all its files - this should never change again
  • use prefix tree directory structure as suggested by @TyberiusPrime
  • don't unpack the full tarball, but only extract the exact file needed?

What queries do we have against the nix-pypi-fetcher db?

chaoflow avatar Dec 08 '22 18:12 chaoflow

What queries do we have against the nix-pypi-fetcher db?

Usually we know the list of packages that we want to query upfront, so this could work in principle.

The problem with that is, that nix itself does not have an operation for extracting a single file from a tarball. We would have to use an IFD for that, which would impact performance and have some other unwanted side-effects.

Another way to implement this, could be to create a new FOD fetcher, which gets as an input the nix-pypi-fetcher.tar.gz, and the metadata for the requested package, then extracts only that information within the FOD and fetches it. That prevents any usage of IFD for that operation and should still be efficient. We might want to use the .zip file instead of the tarball, as tarballs don't have an index and random access is more expensive than with .zip.

DavHau avatar Dec 09 '22 04:12 DavHau

Another way to implement this, could be to create a new FOD fetcher, which gets as an input the nix-pypi-fetcher.tar.gz, and the metadata for the requested package, then extracts only that information within the FOD and fetches it. That prevents any usage of IFD for that operation and should still be efficient. We might want to use the .zip file instead of the tarball, as tarballs don't have an index and random access is more expensive than with .zip.

OOps, that won't work, because we need to know the hash upfront, before constructing the FOD :/

DavHau avatar Dec 09 '22 04:12 DavHau

Ok, so the naive implementation of one file per package in a prefix tree, named 'package.json' more than doubles the size of the zip file, and the unpack time goes from 5 seconds to 45. (on tmpfs / ramdisk).

Using 'ckage' as the file name makes on difference to the zip size.

A tar with those files is only 97M. I think tar is able to compress across files while zip only compresses within a file (=not at all in this case). The tar ball unpacks in 25 seconds.

But just for storage space in the git repo, we might gzip the individual 256 files we have right now? That'd be transparent for python, not sure about nixlang. We gain a factor of 2.

The only way I see around 'not fetching the complete database' is if we turn it into a multi level thing... first level (the hash the user enters) provides all the FOD-hashes for the second level with the actual data.

Sharding would take some creativity though.

Just 1st+2nd letter causes very different sized blocks - min 1K, max 26M (way too many packages with 'py' ;)).

Edit: Sharding by python version: Probably not worth it.

But replacing {"27": {...}, "36": "27", "37": "27", "38": "27", "39": "27", "310": "27"},' with 'any': {...27's contents'...} might be? It seems to happen in about 97% of the relevant cases. (I looked for 'all but one is a string', not equality to "27").

TyberiusPrime avatar Dec 09 '22 08:12 TyberiusPrime

How about we don't use file content at all, but encode everything in the file names?

chaoflow avatar Dec 09 '22 17:12 chaoflow

How about we don't use file content at all, but encode everything in the file names?

I like this idea!

An alternative. Have one repo (e.g. flake) that keeps track of the state. And then another n amount of contents repos containing the data of interest. Every week or so a new repo is created. The state repo is always aware of the contents repo it needs to refer and which revision.

FRidh avatar Jan 31 '23 09:01 FRidh