Make the `ByteRangeRequest` more literate
Continuing a conversation from #1661
We have IO operations that can fetch ranges of bytes. That byte range parameter is parameterized by an (optional) 2-tuple of nullable ints, i.e. tuple[int | None, int | None] | None. The tuple part is ambiguous between two similar specifications of a range of integers -- (start, step) and (start, stop); this ambiguity is bad.
We should rework this type so that it's harder to be confused about these two possibilities. One option suggested by @kylebarron would be a dataclass, e.g. :
@dataclasses.dataclass
class ByteRangeRequest:
start: int | None = 0
step: int | None = None
It would also be nice if there was an instance of this class (e.g., the default) that conveyed "fetch the whole thing", and then byte_range wouldn't need to be optional in all the store methods. I'm also not sure we get much from None in this type, since there are integers that can express "the end" of a range, e.g. -1. Curious to hear other ideas here.
This would be a good opportunity to think if we should also consider a type that allows doing things like "last N bytes" (without knowing the size beforehand). Most (all?) object stores, for example, allow that type of query.
This would be a good opportunity to think if we should also consider a type that allows doing things like "last N bytes" (without knowing the size beforehand). Most (all?) object stores, for example, allow that type of query.
Today the elements of ByteRangeRequest can be negative, so I think (-1, -100) would express "read the the last 100 bytes". Maybe @normanrz can correct me if this is wrong.
@paraseba Suffix requests are already used (https://github.com/zarr-developers/zarr-python/pull/1661#issuecomment-2435452928), we just want to ensure the type is well defined so that semantics are clear.
Most (all?) object stores, for example, allow that type of query.
All except Azure.
Today the elements of
ByteRangeRequestcan be negative, so I think(-1, -100)would express "read the the last 100 bytes".
From here it's (-100, None) that would express the last 100 bytes:
https://github.com/zarr-developers/zarr-python/blob/bc588a760a804f783c4242d4435863a43a5f3f9f/src/zarr/codecs/sharding.py#L700-L702
Yes, it is (-100, None)
That is great, then making this type more clear is even more important.
So, the dataclass would be like this?
@dataclass
class ByteRangeRequest:
start: int # can be negative for indexing from the back
length: int | None # None = read to the end
No it would be more like a union:
class RangeRequest(TypedDict):
"""A request with a start and end in the file"
start: int
"""The first byte of the range"""
end: int
"""The end of the byte range (exclusive)"""
class SuffixRequest(TypedDict):
"""A range that fetches the last N bytes from a file"""
size: int
"""The number of suffix bytes to fetch"""
class OffsetRequest(TypedDict):
"""A request for the first N bytes from a file"""
size: int
"""The number of bytes to fetch"""
ByteRangeRequest = Union[RangeRequest, SuffixRequest, OffsetRequest]
I think here the semantics are very clear, even if the union makes it slightly verbose. And here I used typed dicts but that wouldn't fully work as shown because there's no way to decipher between the suffix and offset request (if they both were to use the "size" dict key), but that's easily fixed.
In case you're curious, in https://github.com/developmentseed/obstore/pull/67 I updated obstore to allow either a tuple of two non-negative integers, or a dict with offset or suffix keys:
Dataclasses are heavier than tuples. Is there any significant performance consequence (memory and / or runtime) of making this change? What if we are fetching 100_000 chunks?
In obstore I'm still using a tuple of two numbers for the most common case of a range with a known, positive start and end. I think the semantics there are pretty clear (same as range(start, end)). It's only for suffix and offset requests that I'm requiring dict input for clarity.
In [1]: %timeit (100, 1000)
4.07 ns ± 0.429 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)
In [3]: %timeit {"suffix": 100}
31.2 ns ± 0.2 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
If you made 100,000 suffix requests, it would be 2.7ms slower from this.
I figure that suffix requests make up a small part of normal usage anyways, and that this makes the API a lot clearer. But if people make issues in obstore in the future saying that this is a perf obstacle I could add more supported inputs.
Should we close this following https://github.com/zarr-developers/zarr-python/pull/2585?