databroker icon indicating copy to clipboard operation
databroker copied to clipboard

Allow slicing in stream(), events(), and table()

Open danielballan opened this issue 8 years ago • 6 comments

As in, you should be able to ask for all rows starting with row N so that you don't need to load and throw away the first N-1 rows.

danielballan avatar Aug 02 '17 19:08 danielballan

There is a way forward on this -- likely adding a parameter that accepts a slice object -- but no time to implement it properly before 0.9.0. Punting to 0.10.0.

danielballan avatar Aug 11 '17 15:08 danielballan

Hasty notes on a design conversation, will expand this into a DBEP and/or a PR later.

Slicing and Chunking

  • data(), table(), and events() should each pick up slice and chunks (or should we call it chunk_size?) parameters.
  • table() will return one DataFrame is chunks=None (its current behavior). It will return a generator of DataFrames with length chunks if chunks is an integer.
  • events() will return a generator of dictionaries (its current behavior) if chunks=None, but a DataFrame of length chunks if chunks is an integer. Thus, in the case where chunks is not None, events() is the same as table().
  • data() will return a generator of things that have length chunks over the first axis. There will be some brains that figures out which type of thing this should be: numpy and dask arrays will be automatically stacked for convenience. The general case might be a simple list.
  • The slice parameter will accept a slice object or a list of positions. (If chunks > 1 these will be interpreted as starting positions for the chunk. Requests might result in overlapping chunks; that's fine.)
  • For now we are only addressing 1D slicing along the Event axis. Slicing along arbitrary data axes is on our roadmap but out of scope for this particular work.

API Cleanup

  • db.get_* methods should be deprecated, vastly reducing the surface area of Broker API
  • Headers never need to call back to their header source of origin, only event sources. Therefore, db should give Header only its event sources , not itself. (So, remove Header.db.)
  • The configuration file should list the handler registries per event source, so that an event source can know exactly what its specific handlers are.
  • Resource and Datum are considered part of an Event Source. (Why? You can't interpret a Datum without a Descriptor and a Resource.)

Better handler configuration

  • The fetching methods (events, table, data, documents, ???) should accept a handlers argument with a custom handler registry.

  • Usages like this allow the user to explicitly destroy a registry (and the potentially heavy cache therein):

    with handler_registry as r:
        header.data('thing', handlers=r)
    
  • If the handlers parameter is not given, the Header will fall back to the Broker's "default" handler registry with an LRU cache. (Does this mean Headers need a reference to db after all?)

danielballan avatar May 25 '18 21:05 danielballan

Does this mean that headers will no longer know where they came from?

CJ-Wright avatar May 27 '18 21:05 CJ-Wright

As we have now factored the databroker into a notion of a "header source" and "event source", the phrase "know where they came from" isn't precise enough have a yes or no answer. Brokers may hold a reference to their event sources but (perhaps) not to their header source.

danielballan avatar May 30 '18 15:05 danielballan

I think a better phrasing would be "headers know how to get their children"

tacaswell avatar May 30 '18 15:05 tacaswell

Relying on xarray+dask gets us part of the way there: cheaply get a lazy "shell" (xarray of dask array), slice on it in potentially complex ways, and then load the relevant chunks.

I think we still need support for pulling a slice of document stream, though, so this issue remains open. Adding an index to Event and Datum will help.

danielballan avatar Mar 06 '20 20:03 danielballan