zarr-python icon indicating copy to clipboard operation
zarr-python copied to clipboard

Make the `ByteRangeRequest` more literate

Open d-v-b opened this issue 1 year ago • 6 comments

Continuing a conversation from #1661

We have IO operations that can fetch ranges of bytes. That byte range parameter is parameterized by an (optional) 2-tuple of nullable ints, i.e. tuple[int | None, int | None] | None. The tuple part is ambiguous between two similar specifications of a range of integers -- (start, step) and (start, stop); this ambiguity is bad.

We should rework this type so that it's harder to be confused about these two possibilities. One option suggested by @kylebarron would be a dataclass, e.g. :

@dataclasses.dataclass
class ByteRangeRequest:
  start: int | None = 0
  step: int | None = None

It would also be nice if there was an instance of this class (e.g., the default) that conveyed "fetch the whole thing", and then byte_range wouldn't need to be optional in all the store methods. I'm also not sure we get much from None in this type, since there are integers that can express "the end" of a range, e.g. -1. Curious to hear other ideas here.

d-v-b avatar Oct 24 '24 14:10 d-v-b

This would be a good opportunity to think if we should also consider a type that allows doing things like "last N bytes" (without knowing the size beforehand). Most (all?) object stores, for example, allow that type of query.

paraseba avatar Oct 24 '24 14:10 paraseba

This would be a good opportunity to think if we should also consider a type that allows doing things like "last N bytes" (without knowing the size beforehand). Most (all?) object stores, for example, allow that type of query.

Today the elements of ByteRangeRequest can be negative, so I think (-1, -100) would express "read the the last 100 bytes". Maybe @normanrz can correct me if this is wrong.

d-v-b avatar Oct 24 '24 14:10 d-v-b

@paraseba Suffix requests are already used (https://github.com/zarr-developers/zarr-python/pull/1661#issuecomment-2435452928), we just want to ensure the type is well defined so that semantics are clear.

Most (all?) object stores, for example, allow that type of query.

All except Azure.

Today the elements of ByteRangeRequest can be negative, so I think (-1, -100) would express "read the the last 100 bytes".

From here it's (-100, None) that would express the last 100 bytes: https://github.com/zarr-developers/zarr-python/blob/bc588a760a804f783c4242d4435863a43a5f3f9f/src/zarr/codecs/sharding.py#L700-L702

kylebarron avatar Oct 24 '24 14:10 kylebarron

Yes, it is (-100, None)

normanrz avatar Oct 24 '24 14:10 normanrz

That is great, then making this type more clear is even more important.

paraseba avatar Oct 24 '24 14:10 paraseba

So, the dataclass would be like this?

@dataclass
class ByteRangeRequest:
  start: int # can be negative for indexing from the back
  length: int | None # None = read to the end

normanrz avatar Oct 24 '24 14:10 normanrz

No it would be more like a union:

class RangeRequest(TypedDict):
	"""A request with a start and end in the file"
	start: int
	"""The first byte of the range"""

	end: int
	"""The end of the byte range (exclusive)"""

class SuffixRequest(TypedDict):
	"""A range that fetches the last N bytes from a file"""

	size: int
	"""The number of suffix bytes to fetch"""

class OffsetRequest(TypedDict):
	"""A request for the first N bytes from a file"""

	size: int
	"""The number of bytes to fetch"""

ByteRangeRequest = Union[RangeRequest, SuffixRequest, OffsetRequest]

I think here the semantics are very clear, even if the union makes it slightly verbose. And here I used typed dicts but that wouldn't fully work as shown because there's no way to decipher between the suffix and offset request (if they both were to use the "size" dict key), but that's easily fixed.

kylebarron avatar Oct 25 '24 01:10 kylebarron

In case you're curious, in https://github.com/developmentseed/obstore/pull/67 I updated obstore to allow either a tuple of two non-negative integers, or a dict with offset or suffix keys: image

kylebarron avatar Nov 01 '24 20:11 kylebarron

Dataclasses are heavier than tuples. Is there any significant performance consequence (memory and / or runtime) of making this change? What if we are fetching 100_000 chunks?

rabernat avatar Nov 01 '24 22:11 rabernat

In obstore I'm still using a tuple of two numbers for the most common case of a range with a known, positive start and end. I think the semantics there are pretty clear (same as range(start, end)). It's only for suffix and offset requests that I'm requiring dict input for clarity.

In [1]: %timeit (100, 1000)
4.07 ns ± 0.429 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

In [3]: %timeit {"suffix": 100}
31.2 ns ± 0.2 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

If you made 100,000 suffix requests, it would be 2.7ms slower from this.

I figure that suffix requests make up a small part of normal usage anyways, and that this makes the API a lot clearer. But if people make issues in obstore in the future saying that this is a perf obstacle I could add more supported inputs.

kylebarron avatar Nov 02 '24 00:11 kylebarron

Should we close this following https://github.com/zarr-developers/zarr-python/pull/2585?

jhamman avatar Jan 10 '25 05:01 jhamman