Rethinking Zarr's core dependencies
I'd like to open the conversation about what Zarr's core dependencies are for 3.0. Currently, this looks like:
https://github.com/zarr-developers/zarr-python/blob/11312534ebe683d73cbbcc2da9e88933cb00cc14/pyproject.toml#L25-L34
Some of these are not used anymore (asciitree and fasteners) so those can safely go.
Then there is fsspec and crc32c. These are only needed for the RemoteStore and ShardingCodec, respectively. What do we think about making these optional?
One proposed diff in our dependencies would look something like:
dependencies = [
- 'asciitree',
'numpy>=1.25',
- 'fasteners',
- 'numcodecs>=0.10.2',
- 'fsspec>2024',
- 'crc32c',
+ 'numcodecs>=0.12',
'typing_extensions',
'donfig',
]
[project.optional-dependencies]
+remote = [
+ "fsspec",
+]
+sharding = [
+ "crc32c",
+]
Notes:
- fsspec is pure python with no dependencies so is not a particular heavy dependency.
- crc32c could potentially move into numcodecs, right?
👍 this seems good to me.
I think sharding is a big enough part of what zarr v3 promises, that it's worth having crc32c as part of the default dependencies. Looking at their files on PyPI the package is very light (~40kB), and it doesn't have any other requirements.
fsspec is also small (200kB), so I wonder if it's worth keeping default too so users don't have to jump through extra hoops to open remote arrays? Given a large use case of zarr is a format for large data > a lot of the time users will be accessing it remotely.
What are the reasons for removing these? Definitely open to considering it, but given they're lightweight deps at the moment I'm thinking we should keep them as default.
I think sharding is a big enough part of what zarr v3 promises, that it's worth having crc32c as part of the default dependencies. Looking at their files on PyPI the package is very light (~40kB), and it doesn't have any other requirements.
Is there a reason why we shouldn't put sharding in numcodecs? then the crc32c dependency would live there.
👍 for that
Here's my thought on fsspec. While I agree that the package dependency is not particularly large, it also don't come with batteries included -- you still need s3fs, gcfs, adlfs, etc to use the RemoteStore. I imagine we're all aligned on making keeping each of the individual implementations out of the required dependency tree. I guess my perspective is that if all of those are optional, and they all depend on fsspec, then we don't gain much by requiring fsspec.
@d-v-b and/or @dstansby - can one of you open an issue on crc32c in numcodecs?
That makes sense to me on fsspec - would be good to add some docs if it's optional, I'll stick a request on https://github.com/zarr-developers/zarr-python/pull/2395.
I opened an issue for cr32c at https://github.com/zarr-developers/numcodecs/issues/610
I also think that we should only drop crc32c as a core zarr dependency once it is part of numcodecs. It would suck if people had to install additional groups to be able to use sharding.
This also would fix #1370 👍