metacat icon indicating copy to clipboard operation
metacat copied to clipboard

Support HTTP Range Requests in MNRead.get

Open robyngit opened this issue 2 years ago • 3 comments
trafficstars

Detect and handle HTTP range requests to enable clients to retrieve a portion of a file without the need to download the entire content. This feature would allow MetacatUI and other clients to preview data files before downloading them. It would also allow clients to resume downloads in the event of a network interruption.

  • @mbjones mentioned the possibility of implementing this without making changes to the DataONE API: A range request could be made via HTTP headers, leaving the request body unchanged and having Metacat only handle the range request headers.
  • It would be the client's responsibility to generate the range request, for example:
curl -H "Range: bytes=0-1000" https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/urn%3Auuid%3A24b85258-3e86-40cb-accc-28153513dea8
  • The feature could be a non-mandatory enhancement, such that the existing behavior remains consistent for repositories not making use of range requests.
  • Apache Tomcat and the Servlet API might provide built-in support for HTTP range requests.
  • A discussion is needed on how this feature interacts with event metrics:
    • Is a range request categorized as a download or a view/read?
    • Does this require a new event type, e.g. "partial read", "preview"?

Note: https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object/urn%3Auuid%3A24b85258-3e86-40cb-accc-28153513dea8 gives a 100,000 line CSV file that could be useful for testing

robyngit avatar Oct 02 '23 21:10 robyngit

@robyngit Two questions:

  1. Does this feature only support the text data files (e.g, cvs)? How about Excel files?
  2. What are the units of the range? Lines or bytes or both?

taojing2002 avatar Oct 02 '23 22:10 taojing2002

@taojing2002 good questions. Range requests are byte-based requests, basically specifiying a byte range to be requested. They are application-agnostic, and assume that the client knows what to do with the bytes. Tools like curl use range requests to allow resuming downloads if a network connection is interrupted. Data systems use range requests to retrieve chunks of data from inside a data file, but that is of course only useful if the data files are organized in such a way that contiguous byte ranges produce meaningful chunks. So, for text files, getting the first few KB is a good way to get a preview, but the client would need to be aware that the byte boundary is unlikely to correspond with the end-of-line delimiter used in that format. In contrast, netCDF, HDF5, and Zarr are binary formats that allow byte range requests that can get specific segments of data that correspond to specific scientifically meaningful chunks (e.g., a single image out of a time series, or a specific spatial window out of a larger extent). Hope that's all helpful.

mbjones avatar Oct 02 '23 23:10 mbjones

@mbjones Thanks! So I think we will use bytes for the range for any formats. The clients have the responsibility to parse the bytes.

taojing2002 avatar Oct 03 '23 00:10 taojing2002