pescador Proposal: epoch / randomness tracking or auditing, and callback API

Sometimes it's useful to know when an "epoch" has been completed. Certain libraries (coughkerascough) have this notion explicitly.
But also, it might be useful when using a Mux to know how often certain streams have been sampled after the fact

I suspect that some sort of callback or query on the Mux object is the right solution here, but I figured it would be best to initiate a discussion first.

Apr 18 '17 13:04 cjacoby

I suspect that some sort of callback or query on the Mux object is the right solution here

I like this idea in the abstract. I'm not sure how exactly it should look though -- a full keras-style callback infrastructure seems overkill, since we don't really have a top-level controller to trigger events.

Thinking a little more about it, this seems like two issues to me.

Book-keeping in mux. We should definitely add this, and provide accessor methods to report statistics (# samples drawn, # times active, etc.)
Epoch callbacks. I think this might be best implemented as a separate controller class that you can stick in front of any iterable/streamer, and which triggers a callback after every n steps. Something like:

def my_callback(...):
    do some stuff

epoch = EpochCallback(n=1000, callbacks=[my_callback])

for item in epoch(my_streamer):
    do some other stuff

then it's just a matter of specifying the interface for callback functions.

Apr 21 '17 19:04 bmcfee

Thoughts from the peanut gallery - is this still in for 1.1.0, or is this really a 2.0 feature? #92 is labelled as 2.0. Trying to clean up what really needs to be done for 1.1 so I can prioritize.

Jun 29 '17 15:06 cjacoby

I'd say ... if the right design is a callback, then 2.0; if it's setting some counters in the object, then maybe 1.1. maybe. thoughts?

Jun 29 '17 15:06 ejhumphrey

I think the callback is probably better / more future-proof.

Jun 29 '17 15:06 cjacoby

on the auditing front, I was just wrestling with my design of "how" I wanted to sample some data for training, and decided I wanted to log my samplers so I could parse things out later and check my statistics. The important parts look like this..

but first! this is entirely proof-of-concept, though I am curious to subsequently discuss better designs, impact on efficiency, etc.

When I'm writing research packages, I like to have my data stream machinery defined in the same submodule. After my import preamble (which includes logging), I set a global stream_logger for all "samplers" (the generator that produces observations from a bag of observations) and a method to create file handlers later:

stream_logger = logging.getLogger("stream_logger")

# Nothing crazy here.
def init_stream_logging(log_file):
    hdlr = logging.FileHandler(log_file)
    formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
    hdlr.setFormatter(formatter)
    stream_logger.addHandler(hdlr)
    stream_logger.setLevel(logging.INFO)

Then, say I have a sampler that plucks observations from an NPZ file ... I add a logging statement on index selection, before yielding data:

def my_sampler(feature_dir, key):
    with np.load(os.path.join(feature_dir, "{}.npz".format(key)) as data:
        X = data['x_in']
        Y = data['y_true']
    N = len(X)
    while True:
        n = np.random.randint(0, N)
        stream_logger.info(json.dumps(dict(key=key, n=n, y_true=Y[n])))
        yield dict(x_in=X[n], y_true=Y[n])

Then, for completeness, I'll init a file handler and mux a stream. Assume I've got a bunch of files that are like '/path/to/features/a.npz', etc...

init_stream_logging("samples.log")
stream = pescador.Mux(
    [pescador.Streamer(my_sampler, "/path/to/features", key) for key in 'abcdefg'],
    k=5, rate=10, revive=True, with_replacement=False, prune_empty_streams=True)
list(stream.iterate(max_iter=1000))

Now we can go pop open that log file and look at some stats!

from collections import Counter
samples = [json.loads(l.strip().split("INFO ")[-1]) for l in open("samples.log")]
Counter([x['y_true'] for x in samples])
# Produces something like...
Counter({0: 110, 1: 99, 2: 101, 3: 92, 4: 100, 5: 101, 6: 111, 7: 92, 8: 94, 9: 100})

Counter([x['key'] for x in samples])
# Produces something like...
Counter({'a': 166, 'b': 127, 'c': 154, 'd': 136, 'e': 152, 'f': 122, 'g': 143})

or whatever.

I haven't had much time to mull this over, so I'm sure I'll have more ideas / opinions later, but thought this was worth sharing given the discussion in #104 (assuming @stefan-balke, @cjacoby may care). In particular though, I'm somewhat worried about the time this would lose to json serialization (there will be so many samples...), and the log-parsing after the fact is quite gross. I'm not sure if atomic logging versus building a cache that'd be worth background threading, but ... I'm guessing.

Also, I kinda like the idea of using logging (rather than keras history) to track training loss / error, but maybe this is out of scope. I'd also be keen to "type" the logs to filter on say samples versus other events, but didn't come across any docs on this (nor did I look very hard).

also also, I looked around and logging still seems to be the best logging library out there ... a few things wrap it (daiquiri, pygogo), but nothing replaces it.

Oct 31 '17 21:10 ejhumphrey

Can this be a 2.1 feature? It should only add to the API, not change existing functionality.

Jan 24 '18 16:01 bmcfee

:+1:

On Wed, Jan 24, 2018 at 8:20 AM Brian McFee [email protected] wrote:

Can this be a 2.1 feature? It should only add to the API, not change existing functionality.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/85#issuecomment-360188175, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4t8ylUexgDF8pCTYcLiVpd1Ca4H_Seks5tN1gzgaJpZM4NAVCS .

Jan 24 '18 17:01 cjacoby

@cjacoby got any cycles to look into this? I'd like to get 2.1 off my stack in the short term.

Aug 12 '19 17:08 bmcfee

Given the radio silence on this, and a lack of clear picture of what exactly the API should be, how do you all feel about dropping this feature @ejhumphrey @cjacoby ? I can see its utility, but it also makes things much more complicated.

Aug 22 '19 17:08 bmcfee

Sorry! I could make time to work on this this week if you think it's useful. (Reading emails tho, that I am bad at making time for ;) )

I think it might make sense to punt to next version just to at least clarify the API/approach. Or, we make it "provisional"?

I have an approach in my head, though don't know if it's the best one.

On Thu, Aug 22, 2019, 10:57 Brian McFee [email protected] wrote:

Given the radio silence on this, and a lack of clear picture of what exactly the API should be, how do you all feel about dropping this feature @ejhumphrey https://github.com/ejhumphrey @cjacoby https://github.com/cjacoby ? I can see its utility, but it also makes things much more complicated.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pescadores/pescador/issues/85?email_source=notifications&email_token=AAHC344ASMD5TIFEJQTUX7TQF3HRNA5CNFSM4DIBKCJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD454TRY#issuecomment-524011975, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHC34ZDXZ5SJV4DTQGQ3NDQF3HRNANCNFSM4DIBKCJA .

Aug 25 '19 17:08 cjacoby

I think it might make sense to punt to next version just to at least clarify the API/approach. Or, we make it "provisional"?

Ok. How about we punt it to some yet-to-be-determined 3.x release then?

Aug 28 '19 00:08 bmcfee