feat: fsspec source implementation
This is looking good. But how will an fsspec-capable Source be enabled? The other Sources are identified by URI scheme, but a URI with https:// or root:// could be handled by either FSSpecSource or the non-fsspec ones. (s3:// would be unambiguous.)
The backend that runs can't be determined by what packages are installed (i.e. attempt to import fsspec and then run different code, depending on whether it raises MissingModuleException) because that would make debugging user issues more complicated.
If it's opt-in (only used if FSSpecSource is explicitly selected in uproot.open), then there will be people who would have wanted to use it that never find out about it.
If it's something that Coffea or coffea-casa triggers when the time is right, that's fine.
One option is to simply make uproot depend on fsspec. Then fsspec is responsible for picking the right implementation for a given protocol. (we could provide an override option) Of course, this means that we have to make sure the fsspec http and xrootd implementations are as robust and performant as the current implementations. This is something @ScottDemarest was interested in looking into.
I know one potential (possibly premature) optimization is to notify downstream executors as soon as some portion of data is available. We can do this within the constraints of the fsspec public API by using async filesystem implementation with an event loop created and held as a resource by uproot (essentially copying the sync wrapper implementation that fsspec uses internally). If we were to do that, it would also be nice to drop the custom executor and futures implementation here and adopt the standard library one, now that python 2 is not supported.
If we want uproot to be "batteries included" we should add these to the regular install requirements if we adopt FSSpec as the standard source.
I get pulled in both directions about whether to include batteries or not.
On the one hand,
- "Easy install" was the top reason that people gave for why they use Uproot in the first place. Adding dependencies jeopardizes that, though the case against fsspec is overstated. fsspec is a very easy dependency, and it's pure Python and everything.
- @agoose77 is WASM-building Awkward for use in Thebe-powered tutorials. We need to keep dependencies minimal for the web (mostly pure Python and not too large to download). Again, using this as a case against fsspec, even if we decide to do a similar in-browser tutorial for Uproot (haven't talked about it), would be overstating it. fsspec is small (141 kB).
On the other hand,
- It's annoying to install a package, try to use it, and immediately be told that you have to install another package. It's even worse if you aren't using the package interactively (or directly: it's a hidden dependency of the one you are using). It could even be a multi-pass cycle of install, try to use, install more, try to use again, install yet more.
- The Python world has well-established dependency resolution tools, but optional dependencies require us to effectively reimplement that, including minimum versions.
- Some of these dependencies implement rather core features. In particular, I'm recently swayed by the fact that we're now saying Awkward's primary on-disk format is Parquet, but to use Parquet you have to install pyarrow (and fsspec). I'm only being held back on that one by the WASM target: pyarrow is huge and a monster to compile.
Currently, the PyPI versions of Uproot and Awkward Array have minimal dependencies, while the conda-forge recipes are batteries included. We've also considered
pip install uproot[complete] # and
pip install awkward[complete]
but then the [complete] labels would basically be synonyms of [dev] and [test]. I don't know what should be in one label and not the others (except maybe [test] should include pytest, the others not).
Enlarging the scope of this, and perhaps answering my previous question: how about if fsspec-enabled remote access replaces the current HTTPSource and XRootDSource?
The HTTPSource was written using only the Python standard library in an extremely dependency-averse mood. It should have been based on requests. (I wasn't familiar with how basic that is to the ecosystem, at least on par with NumPy.) I think we would have had fewer issues with HTTP redirects if we had gone this route from the start.
I have had so much help from user-developers fixing up the XRootDSource, which I'm grateful for, but it indicates how tricky it is. By now, there's a lot of code in XRootDSource that I don't understand. If that was used as a model to develop xrootd-fsspec, then maybe we should discontinue this implementation in favor of that one.
The story for users would be simpler: "If you want to access remote files (any non-file URI), you need fsspec (or install uproot[complete])." And then that comes with its optional dependencies, like s3fs. Then Uproot is out of the business of remote access and relies on the same remote access libraries as everybody else.
We could make this part of the Uproot 4 → 5 switch. What do you think?
(Our comments crossed. It sounds like we have similar thoughts on this.)
The story for users would be simpler: "If you want to access remote files (any non-file URI), you need fsspec (or install
uproot[complete])."
One option is to simply make uproot depend on fsspec.
Yes, Uproot could strictly depend on fsspec, since that's pure Python and small. I don't know if this slippery slope extends to lz4 and xxhash (not pure Python, but small), since LZ4 is a not-uncommon ROOT compression setting. Or ZStandard. Maybe it does, since these are things a user might not even know they need: they just open a TTree and it fails. Dependencies for extra user-visible features like hist and Dask are different, so the slippery slope does stop at some point.
I do sometimes wonder if "easy to install" means simply that pip worked, or that the user was happy to see that pip did not bring in more than one package. But that's a bit beside the point
It should be noted that fsspec is lightweight because it is very much not batteries-included. For example, in a fresh environment, if you try to pip install fsspec and then run
with fsspec.open("http://google.com") as f:
print(f.read(10))
you will get an error:
ImportError: HTTPFileSystem requires "requests" and "aiohttp" to be installed
and if you go install those, then you end up with 12 packages!
Installing collected packages: urllib3, multidict, idna, frozenlist, charset-normalizer, certifi, attrs, async-timeout, yarl, requests, aiosignal, aiohttp
Successfully installed aiohttp-3.8.1 aiosignal-1.2.0 async-timeout-4.0.2 attrs-22.1.0 certifi-2022.6.15 charset-normalizer-2.1.1 frozenlist-1.3.1 idna-3.3 multidict-6.0.2 requests-2.28.1 urllib3-1.26.12 yarl-1.8.1
The same goes for fsspec-xrootd, but doubly-so: the first error will ask you to install fsspec-xrootd, then immediately afterwards it will again ask you to figure out how to install xrootd (because it still is not pip installable). In that spirit, I would say adding fsspec and fsspec-xrootd to the explicit dependencies of uproot is OK because they are very lightweight and themselves will ask the user to install further packages.
I do sometimes wonder if "easy to install" means simply that pip worked, or that the user was happy to see that pip did not bring in more than one package.
I think that's true, so having a tree of transitive dependencies is not off the table. (Earlier, I had been considering cases of a hard-to-access DAQ-or-whatever, without outside network/pip access, but not anymore. There are freezing options.)
It is necessary for all of those transitive dependencies to successfully install, however. Some cases of lz4 installation have broken because they assume liblz4.so is available on the OS somewhere. I don't know if that's still true.
Having Uproot strictly depend on fsspec, and then fsspec asks the user to install requests and aiohttp, would still be an improvement over not having Uproot strictly depend on fsspec, since it would only be one round of telling the user to go install something else, rather than two rounds. I can imagine that getting very annoying.
Edit: Got it: you said all of that, but include both fsspec and fsspec-xrootd, since they're at the same level.
I'm personally +1 on making fsspec a dependency:
- IO is very important in uproot
- fsspec is lightweight RE dependencies
- fsspec is near "standard" for IO in Python ecosystem
- we can drop code :partying_face:
Hello @jpivarski, allow me a quick comment - these are important matters and something that IMO you should raise more widely, nice with org admins and most usefully with users via Gitter/... I would be a bit scared myself if you were going to by default, meaning for standard releases, to increase significantly the dependencies. Indeed, as you said, the success of uproot has been "small and easy". I won't go into big thoughts here, so making a single one at this point: ask yourself the percentage of (the very large number of) uproot users who actually need ffspec ... or have even just heard of it! Adding something that brings in effectively an order magnitude more dependencies for a permil user base seems awkward.
To be clear, adding fsspec as a dependency will bring just 1 new package in by default (namely, fsspec) What one then needs to do is bring in additional packages depending on their remote IO needs, all in an opt-in fashion. But this is not really so much about users wanting fsspec: this is a library that makes uproot development easier by abstracting storage. With this dependency, it would be much easier to, for example, implement the much-desired glob feature, or to add headers (token auth!) to http IO, etc. And we would not be rolling our own http, as the standard library is not nearly sufficient to handle everything, case in point #121 #176 etc.
Of course we still should solicit a lot more feedback on this. @chrisburr in particular may have thoughts
Remaining test failures seem unrelated?
FAILED tests/test_0652_dask-for-awkward.py::test_dask_concatenation - TypeError: could not convert data into an ak.Array
FAILED tests/test_0652_dask-for-awkward.py::test_delay_open - TypeError: could not convert data into an ak.Array
@kkothari2001, are the failing Dask tests ones that require a particular version of dask-awkward, and should therefore be protected by pytest.importorskip or maybe pytest.mark.skipif(dask_awkward.version)?
I had a look, it is breaking due to either changes made in dask-awkward code itself, or from updating to awkward v1.9.0rc5 to v1.9.0rc10 in commit 7b5ca3 in dask-awkward.
On my computer, I tried backporting to dask-awkward version 2022.7a0 it passes, ig for now we can work with that, then figure the reason for the error.
On second thought, there is known to be some discrepancy between my personal dev setup and the PR tests (because I've been installing the latest dask-awkward/main) let's just skip the tests with now with pytest.mark.skip and I'll resolve this issue in my currently open PR #679.
The failing test can be fixed by merging in https://github.com/scikit-hep/uproot5/pull/694.
We have not published fsspec for python 3.6. Should we do this or will uproot soon drop support for 3.6?
Uproot will drop support for Python 3.6: #687. Only inertia has kept this from happening so far: if it's the slightest bit easier for the fsspec source, that would definitely push it over the edge.
Then we're getting on a schedule of dropping Python versions when they're end-of-lifed (can't find the GitHub reference where that was stated, but it's the plan).
@nsmith-, I'm marking this PR as "inactive." I would like to follow up on it someday, though it may get to the point where you'd want to re-apply these changes to a new PR, branched off of main.
Closed in favor of #967