algorithms icon indicating copy to clipboard operation
algorithms copied to clipboard

Dependency Graph

Open Fazel94 opened this issue 8 years ago • 6 comments

If you upload all meta data or just their dependency in some easy to use format like xml , json or even an mySQL full db dump, I can implement a dependency graph and thus answer your blog post questions. I can implement a adoption of page rank or similar algorithm to find the impact factor of packages.

Fazel94 avatar Dec 06 '15 14:12 Fazel94

@Fazel94 Thank you for offering your help. Could you please tell me which blog article you refer to and which data I should upload?

MartinThoma avatar Dec 06 '15 14:12 MartinThoma

Sorry, Here is the post I'm talking of http://martin-thoma.com/analyzing-pypi-metadata/

I would be glad to mine PyPI data. But it is you pleasing for me to get around scraping PyPI myself. I mean as formatted ( as well as it is not a burden for you) data base or serialized version of meta data, specially the dependency list for each package, so I can make a dependency graph on It and may be do a little frequent item set counting to extract which packages people use together.

Thank you for your attention.

Fazel94 avatar Dec 06 '15 15:12 Fazel94

specially the dependency list for each package

There is no such thing as a dependency list of each package in PyPI metadata. You could only download all the packages (completely), look for a requriements.txt and read that.

I can upload the data. However, it is quite a bit. I'm currently running the script again. The scripts beginning with "c" are currently running and even a 7z-compressed csv version of the packages table is about 3 MB.

Would that still be of use for you? If you really want to build the dependency graph, you have to download a quite massive amount of data. Estimating with the query

SELECT sum(size)/1000000000 FROM `urls`

it is currently about 3.3GB. I can give you a better approximation tomorrow.

Where should I upload it?

MartinThoma avatar Dec 06 '15 22:12 MartinThoma

Currently it is at pyromancer and 16.35GB.

I've added a scripts to check for imports in a package.

TODOs are:

  • apply that script to the latest versions of all packages in PyPI
  • analyze the setup.py

Done:

  • download the Python package
  • extract it
  • get the python files
  • insert the gathered data into the database
  • (add a new table to the database for dependencies)

MartinThoma avatar Dec 06 '15 23:12 MartinThoma

Ok, I've just put some more work in it:

  • Download most of the metadata here: https://www.dropbox.com/s/dzqk3rrqzpgmp58/export.7z?dl=0 (2015-12-06 - about 54 MB in 7z compressed format)
  • All packages combined are about 24.5 GB (probably compressed)

If you really want to make the dependency graph, you still have to:

  • implement the get_setup_packages in package_analysis.py
  • run ./package_analysis for all latest releases

This will fill your database with all possible dependencies. Even if you don't implement get_setup_packages it will add probably all dependencies. However, even with a VERY good internet connection I expect that this will probably take several days to run. One could parallelize the download of the packages, but that would still need many hours.

MartinThoma avatar Dec 07 '15 08:12 MartinThoma

@Fazel94 I've just made the script to run it over the complete PyPI database. That will take quite a while. And it corrently ignores setuptools, which is a major issue (but was too complicated to make a secure / fast implementation within just a couple of hours - you could add that, if you want).

How would you like to visualize the graph? It has 67582 nodes and a lot more than 4600 edges (I'm just downloading / building the graph... takes a while). You cannot use graphviz for that.

(By the way, do we know each other? Are you a student from KIT, too?)

By now, the most imported module is os, followed (not even close) by sys, logging, re ... and org. I guess that is an error? I have no idea where that comes from.

MartinThoma avatar Dec 07 '15 13:12 MartinThoma