packagist
packagist copied to clipboard
Full package list as a big JSON file
What's this?
This is an implementation of my suggestion in an earlier email:
I'm wondering if you'd be willing to provide an export of all packages on Packagist, together with basic information (description, downloads, etc., basically everything but versions), as a static file updated every 24h for example?
Background: I have in mind to create a website that lists the top PHP packages, fastest growing packages, etc.
The only way I found using the current API, is to get the list of packages every day: https://packagist.org/packages/list.json
Then get info about each package individually: https://packagist.org/packages/monolog/monolog.json
This translates into roughly 300,000 requests per day, and it's an awful lot of stress on your server, especially considering I'm probably not the only one to do this.
Any chance you'd consider that? I could probably help with a PR, although it'd probably take me some time to dig into the source code.
What does it do?
The packagist:dump-full command generates a new JSON file in the web root: packages-full.json. It is designed to be run once a day.
The output is a big JSON array, where each element is a JSON object similar to the output of /packages/{name}.json, with the following differences:
- pretty-printed
- no
versionskey - the
abandonedkey is always set
Once deployed on packagist.org, I expect the output file to be ~220 MB. The PackageDumper is designed to use little memory, so it should not be a problem to create the file. Client applications willing to read it though, will probably need a JSON streaming library such as salsify/json-streaming-parser.
Sample output
[{
"name": "guzzlehttp\/psr7",
"description": "PSR-7 message implementation that also provides common utility methods",
"time": "2015-03-05T23:21:09+01:00",
...
},{
...
}]
Other ideas I've considered:
- creating a zip file of all the
web/(p|p2)/{vendor}/{package}.jsonfiles; this would work, but it's bloated with versions, and somehow I like a single file for all packages - creating a big xml file instead of a big json file; xml is easily streamable, but because there also is a json streamer library available for PHP, I don't think it brings much value to use a different output format when packagist uses json only for now
Questions
the dump is currently performed as part of theUpdate: now extracted to its own command.packagist:dumpcommand; is this command run every 24 hours? If not, this new dump would have to be extracted to its own command to be run exactly once a day- should we gzip the file and only offer a gz version on the web? Or both a
.jsonand a.json.gz?
Looking forward to your comments!
- the dump is currently performed as part of the
packagist:dumpcommand; is this command run every 24 hours? If not, this new dump would have to be extracted to its own command to be run exactly once a day
AFAIK, the packagist:dump command runs in a cron every 5 minutes or so.
AFAIK, the packagist:dump command runs in a cron every 5 minutes or so.
Thanks. It would be nice to add this information in the docblock of each command class.
I guess I need to create a new command then, to run the new PackageDumper. But I'm afraid of poor naming then, as packagist:dump would have been an obvious choice. What about packagist:dump-full?
I'm wondering whether it actually make sense to create such a huge JSON file (that cannot be processed in a non-streaming way) compared to dumping only a list of package names, and then relying on the composer repository files (the /p2/* endpoints rather than the /packages/* endpoint). The composer repository files are served as static files (meaning that they put less load on the server) and are much more cacheable (a package which is not updated won't invalidate the cache of that file).
I even have a crazy idea: maybe your tool could be maintaining a local mirror of the packagist metadata with the official mirroring tool and then read what it needs from it (listing packages would then be a matter of enumerating the filesystem). That might be more efficient than reading the whole metadata from packagist over and over again, as the mirroring tool will sync only updated packages.
Side note: how much of the size of the file is due to pretty-printing ?
I'm wondering whether it actually make sense to create such a huge JSON file (that cannot be processed in a non-streaming way) compared to dumping only a list of package names, and then relying on the composer repository files (the /p2/* endpoints rather than the /packages/* endpoint). The composer repository files are served as static files (meaning that they put less load on the server) and are much more cacheable (a package which is not updated won't invalidate the cache of that file)
The issue is still the same: issuing 300,000 HTTP requests per day (unless I misunderstood you?). Even if these requests are lightweight, each service needing a daily sync will add a minimum of 4 requests / second to the server (probably more as they may not be spread on the whole day).
I even have a crazy idea: maybe your tool could be maintaining a local mirror of the packagist metadata with the official mirroring tool and then read what it needs from it (listing packages would then be a matter of enumerating the filesystem). That might be more efficient than reading the whole metadata from packagist over and over again, as the mirroring tool will sync only updated packages.
I'm not familiar with the mirror system, but will have a look, thanks.
Side note: how much of the size of the file is due to pretty-printing ?
+35%, roughly. Uglified, ~220 MB down to ~160 MB. And if compressed, the % difference will be smaller.
This won't change the need for streaming anyway, and I personally prefer to be able to understand the JSON just by looking at it. A lot of successful JSON REST APIs pretty-print always, and it's good for DX. Right now with the Composer API, I have to use an web JSON prettifier tool to get a glimpse of it. Now if you really don't want pretty-print, I'll be fine with this as well.
I'm not entirely sure what to do with this tbh.
It seems like adding a huge nightly job for not so much benefit, as I doubt we'll get everyone to use this vs the API.
Maybe I need to try it out and see how long it takes to run..
Anyway if you are interesting in more or less real-time updates of package metadata, I now documented the changes API here https://packagist.org/apidoc#track-package-updates - which lets you refresh only when something changes, which leads to less than 300K http requests per day I would say, probably in the range of 10K updates per day, and you get more fresh data, so seems like a win compared to a nightly complete sync. Have you looked at that already?
Reading back your original post, I see you want trending packages, so you may be interested in download counts, so for that you would need to use the API and not the metadata, so that'd still need periodic refresh of all packages.
Closing here for now, still can reopen if needed as per my last comments above.