databroker
databroker copied to clipboard
Allow slicing in stream(), events(), and table()
As in, you should be able to ask for all rows starting with row N so that you don't need to load and throw away the first N-1 rows.
There is a way forward on this -- likely adding a parameter that accepts a slice object -- but no time to implement it properly before 0.9.0. Punting to 0.10.0.
Hasty notes on a design conversation, will expand this into a DBEP and/or a PR later.
Slicing and Chunking
data(),table(), andevents()should each pick upsliceandchunks(or should we call itchunk_size?) parameters.table()will return one DataFrame ischunks=None(its current behavior). It will return a generator of DataFrames with lengthchunksifchunksis an integer.events()will return a generator of dictionaries (its current behavior) ifchunks=None, but a DataFrame of lengthchunksifchunksis an integer. Thus, in the case wherechunksis notNone,events()is the same astable().data()will return a generator of things that have lengthchunksover the first axis. There will be some brains that figures out which type of thing this should be: numpy and dask arrays will be automatically stacked for convenience. The general case might be a simplelist.- The
sliceparameter will accept asliceobject or a list of positions. (Ifchunks > 1these will be interpreted as starting positions for the chunk. Requests might result in overlapping chunks; that's fine.) - For now we are only addressing 1D slicing along the Event axis. Slicing along arbitrary data axes is on our roadmap but out of scope for this particular work.
API Cleanup
db.get_*methods should be deprecated, vastly reducing the surface area ofBrokerAPI- Headers never need to call back to their header source of origin, only event sources. Therefore,
dbshould give Header only its event sources , not itself. (So, removeHeader.db.) - The configuration file should list the handler registries per event source, so that an event source can know exactly what its specific handlers are.
- Resource and Datum are considered part of an Event Source. (Why? You can't interpret a Datum without a Descriptor and a Resource.)
Better handler configuration
-
The fetching methods (events, table, data, documents, ???) should accept a
handlersargument with a custom handler registry. -
Usages like this allow the user to explicitly destroy a registry (and the potentially heavy cache therein):
with handler_registry as r: header.data('thing', handlers=r) -
If the
handlersparameter is not given, the Header will fall back to the Broker's "default" handler registry with an LRU cache. (Does this mean Headers need a reference todbafter all?)
Does this mean that headers will no longer know where they came from?
As we have now factored the databroker into a notion of a "header source" and "event source", the phrase "know where they came from" isn't precise enough have a yes or no answer. Brokers may hold a reference to their event sources but (perhaps) not to their header source.
I think a better phrasing would be "headers know how to get their children"
Relying on xarray+dask gets us part of the way there: cheaply get a lazy "shell" (xarray of dask array), slice on it in potentially complex ways, and then load the relevant chunks.
I think we still need support for pulling a slice of document stream, though, so this issue remains open. Adding an index to Event and Datum will help.