depfinder
depfinder copied to clipboard
Discussion: what to do with the "import name -> package name" mapping from conda-forge
Hi Team,
depfinder has some code in reports.py that does a pretty good job mapping from "importable module" to "most likely package that has that module". Turns out that the code that enables this behavior in depfinder relies on a part of the bot that has been disabled for a little over a year. That part of the bot generates the files in libcfgraph/import_maps. The import map generation was disabled because it was generating json files that were over 100MB in size. And that was over a year ago. So that brings us to my question of what should we do about this?
- on the one hand, having depfinder spit out "these are the packages that you need to install given your imports" is really great.
- on the other hand, the information that depfinder is using to do this is woefully out of date (~13 months old) and so is old information better than no information?
Bringing this functionality back into the bot is not something I'm not particularly keen to solve right now. It seems like a problem that's very well suited to "use a database for this", but since CF doesn't have access to databases, we're left having to do this with files and git.
If we don't have anyone interested in bringing this functionality back into the bot then my vote would be to disable this feature. We can reconsider bringing it back once the conda-forge bot is providing updated information.
What do you think @beckermr @CJ-Wright @mariusvniekerk?
What are the downsides to disabling this in depfinder?
@ocefpaf too!
Good questions. The bot itself also has code that produces a "ranked hub authorities file" that is related to this. I don't understand that relationship. It might be good to flesh that out a bit maybe?
| idx | file | bot | depfinder |
|---|---|---|---|
| 1 | import_maps_meta.json |
file that contains the upper limit of characters in the import_maps/*.json file names | uses to determine which file to go download when looking for the import name -> package artifact relationship |
| 2 | import_maps/*.json |
files that used to be produced by the bot. contain mapping of import name to packages that provide the import | depfinder grabs these files to produce the mapping of "all possible packages that could provide this import" |
| 3 | .file_listing.json |
bot produces this file as a sorted list of all of the artifacts that are currently on conda-forge | depfinder uses this file to make a mapping of full_package_string : package_name, e.g., 21cmfast-3.0.2-py36h1af98f8_1 : 21cmfast |
| 4 | ranked_hubs_authorities.json |
file produced by the bot that attempts to score packages based on the number of other packages that depend on them, among other things | used by depfinder to determine the "most likely package name" given an import |
ok so how does depfinder use these files?
A. given import_name, figure out which import_maps/* file needs to be downloaded, then download that file. Open up that file, grab all of the artifacts that provide import_name. This step uses rows 1 & 2 above
B. Download file_listing.json (row 3 above) and make a mapping of full_package_string to package_name. For each of the artifacts pulled out in step A, figure out their package_name from the mapping that we make in this step (step B).
C. Given 1 or more package_name's from step B, grab the first one that appears in the ranked list in ranked_hubs_authorities.json (row 4 in the table above)
Does this help @beckermr ?
Helps a bit but files in libcfgraph are not made by the bot. So I think depfinder uses two services.
oh. weird. ok. i guess cf-scripts only writes to cf-graph-countyfair?
in that case, seems like the pypi_name_mapping github action produces import_name_priority_mapping.json that I could use instead
the above file is a data structure that looks like this:
[
{"import_name": "ATE", "ranked_conda_names": ["semi-ate"]},
{"import_name": "AWSIoTPythonSDK", "ranked_conda_names": ["awsiotpythonsdk"]},
...
]
so what writes to libcfgraph then? oh there's a circleci action that updates libcfgraph i guess? what does libcfgraph do?
Right there is a circleci action that writes to libcfgraph. libcfgraph collects info about every package into a single repo. It is used by a bunch of conda-forge stuff including the mamba solver for run exports and our scanning service to try and detect harmful files in packages.
IDK if the import name priority mapping is complete or only covers nodes that are ambiguous. Also note that grayskull uses some of this data too. :/
As usual, the answer is to fix libcfgraph and just keep the status quo. We don't have the resources to pay down debt, but we can service it.
@ericdill New import to pkg maps are appearing here: https://github.com/regro/libcfgraph/tree/master/import_to_pkg_maps
These only have the package name and not the full artifact. They should be a lot smaller.
@ericdill I'm a bit late for this discussion but, my opinion, is the same as before we added this to depfinder. It is a nice feature to have but I'd rather have it as a plugin/optional/separate module, etc than inside depfinder itself in order to reduce the maintenance burden here.
Agreed. We should ship a package of simple apis for pulling this metadata.
This has the nice side effect that if the data is moved to another device we can easily move everything over.
oh that's a nice idea. would we make that new package part of the regro org?
is the same as before
thanks @ocefpaf . i had forgotten the previous discussion. glad you recall!
oh that's a nice idea. would we make that new package part of the regro org?
Sure. That's the best spot. Something like conda-forge-tick-data would be fine.
So grayskull doesn't pull from the bot data for these maps anymore. It maintains its own list of differences. They may have come from the bot at one time, but now it is separated.
The data used by depfinder is now wrapped into this package: https://github.com/regro/conda-forge-metadata
Here is how to use it
from conda_forge_metadata.autotick_bot import map_import_to_package
def test_map_import_to_package():
assert map_import_to_package("numpy") == "numpy"
assert map_import_to_package("numpy.linalg") == "numpy"
# something bespoke
assert map_import_to_package("eastlake") == "des-eastlake"
assert map_import_to_package("scipy") == "scipy"
The reduced-size mapping is now erroring out too: https://github.com/regro/libcfgraph/issues/14
:D