depfinder Discussion: what to do with the "import name -> package name" mapping from conda-forge

Hi Team,

depfinder has some code in reports.py that does a pretty good job mapping from "importable module" to "most likely package that has that module". Turns out that the code that enables this behavior in depfinder relies on a part of the bot that has been disabled for a little over a year. That part of the bot generates the files in libcfgraph/import_maps. The import map generation was disabled because it was generating json files that were over 100MB in size. And that was over a year ago. So that brings us to my question of what should we do about this?

on the one hand, having depfinder spit out "these are the packages that you need to install given your imports" is really great.
on the other hand, the information that depfinder is using to do this is woefully out of date (~13 months old) and so is old information better than no information?

Bringing this functionality back into the bot is not something I'm not particularly keen to solve right now. It seems like a problem that's very well suited to "use a database for this", but since CF doesn't have access to databases, we're left having to do this with files and git.

If we don't have anyone interested in bringing this functionality back into the bot then my vote would be to disable this feature. We can reconsider bringing it back once the conda-forge bot is providing updated information.

What do you think @beckermr @CJ-Wright @mariusvniekerk?

What are the downsides to disabling this in depfinder?

Mar 05 '23 14:03 ericdill

@ocefpaf too!

Mar 05 '23 14:03 ericdill

Good questions. The bot itself also has code that produces a "ranked hub authorities file" that is related to this. I don't understand that relationship. It might be good to flesh that out a bit maybe?

Mar 05 '23 16:03 beckermr

idx file bot depfinder

1 import_maps_meta.json file that contains the upper limit of characters in the import_maps/*.json file names uses to determine which file to go download when looking for the import name -> package artifact relationship

2 import_maps/*.json files that used to be produced by the bot. contain mapping of import name to packages that provide the import depfinder grabs these files to produce the mapping of "all possible packages that could provide this import"

3 .file_listing.json bot produces this file as a sorted list of all of the artifacts that are currently on conda-forge depfinder uses this file to make a mapping of full_package_string : package_name, e.g., 21cmfast-3.0.2-py36h1af98f8_1 : 21cmfast

4 ranked_hubs_authorities.json file produced by the bot that attempts to score packages based on the number of other packages that depend on them, among other things used by depfinder to determine the "most likely package name" given an import

idx	file	bot	depfinder
1	`import_maps_meta.json`	file that contains the upper limit of characters in the import_maps/*.json file names	uses to determine which file to go download when looking for the import name -> package artifact relationship
2	`import_maps/*.json`	files that used to be produced by the bot. contain mapping of import name to packages that provide the import	depfinder grabs these files to produce the mapping of "all possible packages that could provide this import"
3	`.file_listing.json`	bot produces this file as a sorted list of all of the artifacts that are currently on conda-forge	depfinder uses this file to make a mapping of `full_package_string : package_name`, e.g., `21cmfast-3.0.2-py36h1af98f8_1 : 21cmfast`
4	`ranked_hubs_authorities.json`	file produced by the bot that attempts to score packages based on the number of other packages that depend on them, among other things	used by depfinder to determine the "most likely package name" given an import

ok so how does depfinder use these files?

A. given import_name, figure out which import_maps/* file needs to be downloaded, then download that file. Open up that file, grab all of the artifacts that provide import_name. This step uses rows 1 & 2 above B. Download file_listing.json (row 3 above) and make a mapping of full_package_string to package_name. For each of the artifacts pulled out in step A, figure out their package_name from the mapping that we make in this step (step B). C. Given 1 or more package_name's from step B, grab the first one that appears in the ranked list in ranked_hubs_authorities.json (row 4 in the table above)

Does this help @beckermr ?

Mar 05 '23 17:03 ericdill

Helps a bit but files in libcfgraph are not made by the bot. So I think depfinder uses two services.

Mar 05 '23 18:03 beckermr

oh. weird. ok. i guess cf-scripts only writes to cf-graph-countyfair?

in that case, seems like the pypi_name_mapping github action produces import_name_priority_mapping.json that I could use instead

the above file is a data structure that looks like this:

[
  {"import_name": "ATE", "ranked_conda_names": ["semi-ate"]}, 
  {"import_name": "AWSIoTPythonSDK", "ranked_conda_names": ["awsiotpythonsdk"]}, 
  ...
]

so what writes to libcfgraph then? oh there's a circleci action that updates libcfgraph i guess? what does libcfgraph do?

Mar 05 '23 18:03 ericdill

Right there is a circleci action that writes to libcfgraph. libcfgraph collects info about every package into a single repo. It is used by a bunch of conda-forge stuff including the mamba solver for run exports and our scanning service to try and detect harmful files in packages.

Mar 05 '23 19:03 beckermr

IDK if the import name priority mapping is complete or only covers nodes that are ambiguous. Also note that grayskull uses some of this data too. :/

Mar 05 '23 19:03 beckermr

As usual, the answer is to fix libcfgraph and just keep the status quo. We don't have the resources to pay down debt, but we can service it.

Mar 05 '23 20:03 beckermr

@ericdill New import to pkg maps are appearing here: https://github.com/regro/libcfgraph/tree/master/import_to_pkg_maps

These only have the package name and not the full artifact. They should be a lot smaller.

Mar 06 '23 13:03 beckermr

@ericdill I'm a bit late for this discussion but, my opinion, is the same as before we added this to depfinder. It is a nice feature to have but I'd rather have it as a plugin/optional/separate module, etc than inside depfinder itself in order to reduce the maintenance burden here.

Mar 06 '23 13:03 ocefpaf

Agreed. We should ship a package of simple apis for pulling this metadata.

Mar 06 '23 13:03 beckermr

This has the nice side effect that if the data is moved to another device we can easily move everything over.

Mar 06 '23 13:03 beckermr

oh that's a nice idea. would we make that new package part of the regro org?

Mar 06 '23 19:03 ericdill

is the same as before

thanks @ocefpaf . i had forgotten the previous discussion. glad you recall!

Mar 06 '23 19:03 ericdill

oh that's a nice idea. would we make that new package part of the regro org?

Sure. That's the best spot. Something like conda-forge-tick-data would be fine.

Mar 06 '23 19:03 beckermr

So grayskull doesn't pull from the bot data for these maps anymore. It maintains its own list of differences. They may have come from the bot at one time, but now it is separated.

Mar 12 '23 11:03 beckermr

The data used by depfinder is now wrapped into this package: https://github.com/regro/conda-forge-metadata

Here is how to use it

from conda_forge_metadata.autotick_bot import map_import_to_package


def test_map_import_to_package():
    assert map_import_to_package("numpy") == "numpy"
    assert map_import_to_package("numpy.linalg") == "numpy"

    # something bespoke
    assert map_import_to_package("eastlake") == "des-eastlake"

    assert map_import_to_package("scipy") == "scipy"

Mar 12 '23 11:03 beckermr

The reduced-size mapping is now erroring out too: https://github.com/regro/libcfgraph/issues/14

:D

Jun 20 '23 15:06 jaimergp