algorithms
algorithms copied to clipboard
Dependency Graph
If you upload all meta data or just their dependency in some easy to use format like xml , json or even an mySQL full db dump, I can implement a dependency graph and thus answer your blog post questions. I can implement a adoption of page rank or similar algorithm to find the impact factor of packages.
@Fazel94 Thank you for offering your help. Could you please tell me which blog article you refer to and which data I should upload?
Sorry, Here is the post I'm talking of http://martin-thoma.com/analyzing-pypi-metadata/
I would be glad to mine PyPI data. But it is you pleasing for me to get around scraping PyPI myself. I mean as formatted ( as well as it is not a burden for you) data base or serialized version of meta data, specially the dependency list for each package, so I can make a dependency graph on It and may be do a little frequent item set counting to extract which packages people use together.
Thank you for your attention.
specially the dependency list for each package
There is no such thing as a dependency list of each package in PyPI metadata. You could only download all the packages (completely), look for a requriements.txt
and read that.
I can upload the data. However, it is quite a bit. I'm currently running the script again. The scripts beginning with "c" are currently running and even a 7z-compressed csv version of the packages
table is about 3 MB.
Would that still be of use for you? If you really want to build the dependency graph, you have to download a quite massive amount of data. Estimating with the query
SELECT sum(size)/1000000000 FROM `urls`
it is currently about 3.3GB. I can give you a better approximation tomorrow.
Where should I upload it?
Currently it is at pyromancer
and 16.35GB.
I've added a scripts to check for imports in a package.
TODOs are:
- apply that script to the latest versions of all packages in PyPI
- analyze the setup.py
Done:
- download the Python package
- extract it
- get the python files
- insert the gathered data into the database
- (add a new table to the database for dependencies)
Ok, I've just put some more work in it:
- Download most of the metadata here: https://www.dropbox.com/s/dzqk3rrqzpgmp58/export.7z?dl=0 (2015-12-06 - about 54 MB in 7z compressed format)
- All packages combined are about 24.5 GB (probably compressed)
If you really want to make the dependency graph, you still have to:
- implement the
get_setup_packages
inpackage_analysis.py
- run
./package_analysis
for all latest releases
This will fill your database with all possible dependencies. Even if you don't implement get_setup_packages
it will add probably all dependencies. However, even with a VERY good internet connection I expect that this will probably take several days to run. One could parallelize the download of the packages, but that would still need many hours.
@Fazel94 I've just made the script to run it over the complete PyPI database. That will take quite a while. And it corrently ignores setuptools, which is a major issue (but was too complicated to make a secure / fast implementation within just a couple of hours - you could add that, if you want).
How would you like to visualize the graph? It has 67582 nodes and a lot more than 4600 edges (I'm just downloading / building the graph... takes a while). You cannot use graphviz for that.
(By the way, do we know each other? Are you a student from KIT, too?)
By now, the most imported module is os
, followed (not even close) by sys
, logging
, re
... and org
. I guess that is an error? I have no idea where that comes from.