feat: Implement read coalescing algorithm
Tuning to be investigated
The removal of the stateful handle is motivated by https://github.com/scikit-hep/uproot5/issues/1157#issuecomment-1981832282
Only in python 3.11+ do we see
tests/test_0692_fsspec_reading.py::test_open_fsspec_s3[FSSpecSource] FAILED [ 70%]
...
E ValueError: not a ROOT file: first four bytes are b'\xec_wu'
E in file s3://pivarski-princeton/pythia_ppZee_run17emb.picoDst.root
looking into it...
edit: because in other versions the test is skipped:
tests/test_0692_fsspec_reading.py::test_open_fsspec_s3[FSSpecSource] SKIPPED [ 70%]
tests/test_0692_fsspec_reading.py::test_open_fsspec_s3[S3Source] SKIPPED [ 70%]
tests/test_0692_fsspec_reading.py::test_open_fsspec_s3[None] SKIPPED [ 70%]
disabled in versions less than 3.11 in https://github.com/scikit-hep/uproot5/pull/1012
The problem is the signature of fs.cat_file. For the file
import fsspec
fs, path = fsspec.url_to_fs("s3://pivarski-princeton/pythia_ppZee_run17emb.picoDst.root", anon=True)
If I create a file-object handle and read the first 4 bytes, I get:
fh = fs.open(path)
fh.seek(0)
assert fh.read(4) == b"root"
if I instead use cat_file, I find:
data = fs.cat_file(path, 0, 4)
assert len(data) == 4 # fails, len(data) is 978299156
assert data == b"root" # also fails, data[:4] is b'\x00\x00\xf1l'
So there is some mismatch in the behavior that I was initially puzzled by. Then I realized that, depending on the implementation, fs.cat_file can have different signatures and the s3fs one has a different order of keyword arguments. I made a UX issue in https://github.com/fsspec/filesystem_spec/issues/1600 and used keywords here.
The defaults are in spirit similar to the CMSSW defaults, though the use case is certainly different. I was hoping to gain some experience in the wild how different these are.
As for implementing it in pre-fsspec sources, indeed that is an option, and one I think may be useful. For example, even in the case of vector reads (which fsspec is now doing if the protocol supports, through the cat_ranges interface) there may be value in issuing separate reasonably-sized vector read requests rather than one large one.
It's Thursday; I think we should go ahead with this PR. I'll update to main and then merge.