datasets icon indicating copy to clipboard operation
datasets copied to clipboard

fix: show correct package name to install biopython

Open BioGeek opened this issue 1 year ago • 0 comments

When you try to download a dataset that uses biopython, like load_dataset("InstaDeepAI/multi_species_genomes"), you get the error:

>>> from datasets import load_dataset
>>> dataset = load_dataset("InstaDeepAI/multi_species_genomes")
/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py:1454: FutureWarning: The repository for InstaDeepAI/multi_species_genomes contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/InstaDeepAI/multi_species_genomes
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Downloading builder script: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.51k/7.51k [00:00<00:00, 7.67MB/s]
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.2k/17.2k [00:00<00:00, 11.0MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 2548, in load_dataset
    builder_instance = load_dataset_builder(
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 2220, in load_dataset_builder
    dataset_module = dataset_module_factory(
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 1871, in dataset_module_factory
    raise e1 from None
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 1844, in dataset_module_factory
    ).get_module()
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 1466, in get_module
    local_imports = _download_additional_modules(
  File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 346, in _download_additional_modules
    raise ImportError(
ImportError: To be able to use InstaDeepAI/multi_species_genomes, you need to install the following dependency: Bio.
Please install it using 'pip install Bio' for instance.
>>> 

Bio comes from the biopython package that can be installed with pip install biopython, not with pip install Bio as suggested.

This PR adds special logic to show the correct package name in the error message of _download_additional_modules, similarly as is done for sklearn / scikit-learn already.

There are more packages where importable module name differs from the PyPI package name, so this could be made more generic, like:

# Mapping of importable module names to their PyPI package names
package_map = {
    "sklearn": "scikit-learn",
    "Bio": "biopython",
    "PIL": "Pillow",
    "bs4": "beautifulsoup4"
}

for module_name, pypi_name in package_map.items():
    if module_name in needs_to_be_installed.keys():
        needs_to_be_installed[module_name] = pypi_name

BioGeek avatar Feb 13 '24 14:02 BioGeek