uproot5 icon indicating copy to clipboard operation
uproot5 copied to clipboard

feat: Implement read coalescing algorithm

Open nsmith- opened this issue 1 year ago • 2 comments

Tuning to be investigated

nsmith- avatar Apr 17 '24 05:04 nsmith-

The removal of the stateful handle is motivated by https://github.com/scikit-hep/uproot5/issues/1157#issuecomment-1981832282

nsmith- avatar Apr 17 '24 05:04 nsmith-

Only in python 3.11+ do we see

tests/test_0692_fsspec_reading.py::test_open_fsspec_s3[FSSpecSource] FAILED [ 70%]
...
E               ValueError: not a ROOT file: first four bytes are b'\xec_wu'
E               in file s3://pivarski-princeton/pythia_ppZee_run17emb.picoDst.root

looking into it...

edit: because in other versions the test is skipped:

tests/test_0692_fsspec_reading.py::test_open_fsspec_s3[FSSpecSource] SKIPPED [ 70%]
tests/test_0692_fsspec_reading.py::test_open_fsspec_s3[S3Source] SKIPPED [ 70%]
tests/test_0692_fsspec_reading.py::test_open_fsspec_s3[None] SKIPPED     [ 70%]

disabled in versions less than 3.11 in https://github.com/scikit-hep/uproot5/pull/1012

nsmith- avatar Apr 22 '24 02:04 nsmith-

The problem is the signature of fs.cat_file. For the file

import fsspec

fs, path = fsspec.url_to_fs("s3://pivarski-princeton/pythia_ppZee_run17emb.picoDst.root", anon=True)

If I create a file-object handle and read the first 4 bytes, I get:

fh = fs.open(path)
fh.seek(0)
assert fh.read(4) == b"root"

if I instead use cat_file, I find:

data = fs.cat_file(path, 0, 4)
assert len(data) == 4  # fails, len(data) is 978299156
assert data == b"root" # also fails, data[:4] is b'\x00\x00\xf1l'

So there is some mismatch in the behavior that I was initially puzzled by. Then I realized that, depending on the implementation, fs.cat_file can have different signatures and the s3fs one has a different order of keyword arguments. I made a UX issue in https://github.com/fsspec/filesystem_spec/issues/1600 and used keywords here.

nsmith- avatar May 06 '24 23:05 nsmith-

The defaults are in spirit similar to the CMSSW defaults, though the use case is certainly different. I was hoping to gain some experience in the wild how different these are.

As for implementing it in pre-fsspec sources, indeed that is an option, and one I think may be useful. For example, even in the case of vector reads (which fsspec is now doing if the protocol supports, through the cat_ranges interface) there may be value in issuing separate reasonably-sized vector read requests rather than one large one.

nsmith- avatar May 07 '24 18:05 nsmith-

It's Thursday; I think we should go ahead with this PR. I'll update to main and then merge.

jpivarski avatar May 10 '24 00:05 jpivarski