pooch icon indicating copy to clipboard operation
pooch copied to clipboard

Aliases for filenames

Open michaelaye opened this issue 7 months ago • 7 comments

Hi! Thanks for your awesome tool!

I wonder if you would consider an aliasing feature, if we can add it without breaking anyone's existing code, maybe with an additional substructure to the pooch creater? The use case should be obvious I hope: Sometimes the stored datafile name is awful for simple access and I wouldn't want to use that filename anywhere else apart from in a registry. So one needs a shorter alias for a given file.

Here's a real life example that I use:

file_aliases = {
    "fans": "P4_catalog_v1.1_L1C_cut_0.5_fan.csv.zip",
    "blotches": "P4_catalog_v1.1_L1C_cut_0.5_blotch.csv.zip",
    "metadata": "P4_catalog_v1.1_metadata.csv.zip",
    "tile_coords": "P4_catalog_v1.1_tile_coords_final.csv.zip",
    "raw_data": "P4_catalog_v1.0_raw_classifications.hdf.zip",
    "intermediate": "P4_catalog_v1.0_pipeline_products.zip",
    "region_names": "region_names.zip",
    "tile_urls": "tile_urls.csv.zip",
}

# Create registry with pooch
v1 = pooch.create(
    path=pooch.os_cache("p4tools"),  # Local storage location
    base_url="https://zenodo.org/record/8102805/files/",  # Remote data location
    registry={  # File registry: Mapping between file names and their hashes
        "P4_catalog_v1.1_L1C_cut_0.5_fan.csv.zip": "md5:71ff51ff79d6e975f704f19b1996d8ea",
        "P4_catalog_v1.1_L1C_cut_0.5_blotch.csv.zip": "md5:f4d0c101f65abbaf34e092620133d56e",
        "P4_catalog_v1.1_metadata.csv.zip": "md5:c0dc46e0fc3d259c30afaec412074eae",
        "P4_catalog_v1.1_tile_coords_final.csv.zip": "md5:6b9a917a6997f1aa01cfef4322cabd81",
        "P4_catalog_v1.0_raw_classifications.hdf.zip": "md5:39a8909590fe9f816454db93f0027d2c",
        "P4_catalog_v1.0_pipeline_products.zip": "md5:6544bf0c7851eedd4783859c0adc42d7",
        "region_names.zip": "md5:9101c7a0f8e248c9ffe9c07869da5635",
        "tile_urls.csv.zip": "md5:5717c8379d453cf4b11a5f5775f5fb6e",
    }
)

Now, instead of like this, one could imagine another option for pooch.create, that adds aliases to the filenames:

# Create registry with pooch
v1 = pooch.create(
    path=pooch.os_cache("p4tools"),  # Local storage location
    base_url="https://zenodo.org/record/8102805/files/",  # Remote data location
    registry={  # File registry: Mapping between file names and their hashes
        "P4_catalog_v1.1_L1C_cut_0.5_fan.csv.zip": "md5:71ff51ff79d6e975f704f19b1996d8ea",
        "P4_catalog_v1.1_L1C_cut_0.5_blotch.csv.zip": "md5:f4d0c101f65abbaf34e092620133d56e",
        "P4_catalog_v1.1_metadata.csv.zip": "md5:c0dc46e0fc3d259c30afaec412074eae",
        "P4_catalog_v1.1_tile_coords_final.csv.zip": "md5:6b9a917a6997f1aa01cfef4322cabd81",
        "P4_catalog_v1.0_raw_classifications.hdf.zip": "md5:39a8909590fe9f816454db93f0027d2c",
        "P4_catalog_v1.0_pipeline_products.zip": "md5:6544bf0c7851eedd4783859c0adc42d7",
        "region_names.zip": "md5:9101c7a0f8e248c9ffe9c07869da5635",
        "tile_urls.csv.zip": "md5:5717c8379d453cf4b11a5f5775f5fb6e",
    },
    aliases=file_aliases,
)

would enable users to use the aliases to fetch a file instead of the cumbersome long file names:

fans = v1.fetch('fans')


**Are you willing to help implement and maintain this feature?** 
<!--
Every feature we add is code that we will have to maintain and keep updated.
Please let us know if you're willing to help maintain this feature in the future.
-->
Yep, that's fine!

michaelaye avatar May 15 '25 19:05 michaelaye

Would using the urls parameter work as a solution here? Like in this demo. This has worked for me when I was looking for something similar in the past. My example below will store your local files without extensions, but for me that was acceptable.

If this solution works for you, note you can also include the URLs as a third column in registry files (example).

# Create registry with pooch
v1 = pooch.create(
    path=pooch.os_cache("p4tools"),  # Local storage location
    base_url="https://zenodo.org/record/8102805/files/",  # Remote data location
    registry={  # File registry: Mapping between file names and their hashes
        "fans": "md5:71ff51ff79d6e975f704f19b1996d8ea",
        "blotches": "md5:f4d0c101f65abbaf34e092620133d56e",
        "metadata": "md5:c0dc46e0fc3d259c30afaec412074eae",
        "tile_coords": "md5:6b9a917a6997f1aa01cfef4322cabd81",
        "raw_data": "md5:39a8909590fe9f816454db93f0027d2c",
        "intermediate": "md5:6544bf0c7851eedd4783859c0adc42d7",
        "region_names": "md5:9101c7a0f8e248c9ffe9c07869da5635",
        "tile_urls": "md5:5717c8379d453cf4b11a5f5775f5fb6e",
    },
    urls={  # File URLs: Mapping between the filenames for local storage and the URLs to download from
        "fans": "P4_catalog_v1.1_L1C_cut_0.5_fan.csv.zip",
        "blotches": "P4_catalog_v1.1_L1C_cut_0.5_blotch.csv.zip",
        "metadata": "P4_catalog_v1.1_metadata.csv.zip",
        "tile_coords": "P4_catalog_v1.1_tile_coords_final.csv.zip",
        "raw_data": "P4_catalog_v1.0_raw_classifications.hdf.zip",
        "intermediate": "P4_catalog_v1.0_pipeline_products.zip",
        "region_names": "region_names.zip",
        "tile_urls": "tile_urls.csv.zip",
    }
)

fans = v1.fetch("fans")

remrama avatar May 16 '25 16:05 remrama

ok, that kinda works. the only thing it clashes with is the possibility to download the registry from the doi server, e.g. zenodo. but one can mangle these around with a bit of scripting...

michaelaye avatar May 23 '25 16:05 michaelaye

hm, it doesn't work:

Image

michaelaye avatar May 23 '25 16:05 michaelaye

maybe the urls are not allowed to be partial?

michaelaye avatar May 23 '25 16:05 michaelaye

yep, they need to be complete URLS:

Image

michaelaye avatar May 23 '25 17:05 michaelaye

a further disadvantage of this hack is that the aliases are being used as filenames on the local filesystem, which is quite inelegant and against conventions of not showing an extension indicating the filetype and/or compressed status.

michaelaye avatar May 23 '25 22:05 michaelaye

Yes, indeed this is most definitely a hack, and for reasons you pointed out it has limited usecases. Thanks for showing these limitations.

remrama avatar May 25 '25 20:05 remrama

@michaelaye thanks for opening the issue! I've thought about this but decided against adding the feature. Mostly because we want to keep Pooch quite simple and small since it's being used as a dependency of other larger packages. Our idea is that end users of packages wouldn't interact with Pooch directly and instead packages offer more friendly functions that don't require the file name. To be honest, I regret the way we implemented the registry but we're pretty much stuck with it now.

That's to say that I'm not in favor of adding the alias feature here since it would be a challenge to make it completely backwards compatible without being clunky. It could be done on your end, though, in 2 ways:

  1. Use a dictionary to map between alias and file name: v1.fetch(fnames["fans"])
  2. Implement a wrapper function and don't expose users to Pooch:
def fetch_fans():
    return v1.fetch("P4_catalog_v1.1_L1C_cut_0.5_fan.csv.zip")

Option 2 is what most projects do and what I'd recommend.

leouieda avatar Aug 26 '25 20:08 leouieda