datasets
datasets copied to clipboard
fix: show correct package name to install biopython
When you try to download a dataset that uses biopython, like load_dataset("InstaDeepAI/multi_species_genomes")
, you get the error:
>>> from datasets import load_dataset
>>> dataset = load_dataset("InstaDeepAI/multi_species_genomes")
/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py:1454: FutureWarning: The repository for InstaDeepAI/multi_species_genomes contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/InstaDeepAI/multi_species_genomes
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
Downloading builder script: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.51k/7.51k [00:00<00:00, 7.67MB/s]
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.2k/17.2k [00:00<00:00, 11.0MB/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 2548, in load_dataset
builder_instance = load_dataset_builder(
File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 2220, in load_dataset_builder
dataset_module = dataset_module_factory(
File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 1871, in dataset_module_factory
raise e1 from None
File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 1844, in dataset_module_factory
).get_module()
File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 1466, in get_module
local_imports = _download_additional_modules(
File "/home/j.vangoey/.pyenv/versions/multi_species_genomes/lib/python3.10/site-packages/datasets/load.py", line 346, in _download_additional_modules
raise ImportError(
ImportError: To be able to use InstaDeepAI/multi_species_genomes, you need to install the following dependency: Bio.
Please install it using 'pip install Bio' for instance.
>>>
Bio
comes from the biopython
package that can be installed with pip install biopython
, not with pip install Bio
as suggested.
This PR adds special logic to show the correct package name in the error message of _download_additional_modules
, similarly as is done for sklearn
/ scikit-learn
already.
There are more packages where importable module name differs from the PyPI package name, so this could be made more generic, like:
# Mapping of importable module names to their PyPI package names
package_map = {
"sklearn": "scikit-learn",
"Bio": "biopython",
"PIL": "Pillow",
"bs4": "beautifulsoup4"
}
for module_name, pypi_name in package_map.items():
if module_name in needs_to_be_installed.keys():
needs_to_be_installed[module_name] = pypi_name