hamilton icon indicating copy to clipboard operation
hamilton copied to clipboard

node that is always re-run

Open gravesee opened this issue 2 years ago • 25 comments

Is your feature request related to a problem? Please describe. For a documentation project I am using Hamilton to generate Word output using the python-docx package. I have nodes representing sections of the Word document. These sections are organized into their own modules. The function code takes a docx.Document instance that should not be shared between other sections. Once the sections are all computed, I use a GraphAdapter to combine them into a single Word document. Currently, I am instantiating a new instance of docx.Document in each section. I would like to put this configuration in a single node that gets re-computed each time it is used as a dependency.

Describe the solution you'd like A decorator that instructs the driver to always recompute the node.

Describe alternatives you've considered I have tried creating an infinite generator that returns the instantiated docx.Document but the usage within other nodes is ugly (requiring a call to next)

Additional context

gravesee avatar Mar 02 '23 17:03 gravesee

Ok, I think I get what's happening. It's a side-effect, so I think we want to model it as so... I think getting it to recompute every time might be tricky (the memoization is a pretty core feature), but IMO have quite a few options to make the code clean. Before I dig in I want to make sure that I understand it properly -- any chance you could post a little code/pseudocode to validate?

Also happy to jump on a call and walk through -- might make the back and forth faster!

elijahbenizzy avatar Mar 02 '23 19:03 elijahbenizzy

Here is a simple use case that should pass the tests if the requested feature is added to Hamilton:

def no_memo() -> dict:
    return {}

def func1(no_memo: dict) -> dict:
    no_memo['func1'] = 'val1'
    return no_memo

def func2(no_memo: dict) -> dict:
    no_memo['func2'] = 'val2'
    return no_memo


if __name__ == "__main__":
    import __main__ as main
    
    from hamilton.driver import Driver
    from hamilton.base import SimplePythonGraphAdapter, DictResult
    
    dr = Driver({}, main, adapter=SimplePythonGraphAdapter(DictResult))
    res = dr.execute(["func1", "func2"])
    
    print(res)
    # {'func1': {'func1': 'val1', 'func2': 'val2'},
    #  'func2': {'func1': 'val1', 'func2': 'val2'}}
    assert res['func1'] == {"func1": "val1"}
    assert res['func2'] == {"func2": "val2"}

gravesee avatar Mar 02 '23 21:03 gravesee

Yep -- got it.

Ok, so I get what you're doing/asking, but I'm not 100% convinced this is the right solution to the original problem. Of two minds:

Con:

  • This is only useful in the case in which nodes have side-effects (if they don't, they wouldn't need to be un-memoized), which, IMO, should probably be contained as much as possible in the DAG. The more side-effects become common, the less optimizations/whatnot we can do with Hamilton.
  • There are a bunch of other ways to solve this that are cleaner than the original generator you had (which I actually like :)) 1. Use @parameterize to create a bunch of these, each individually labeled, then refer to those ones as the upstream 2. Add a wrapper to the object you want that makes it a one-time-use call. E.G. have a class that delegates to the underlying object and .get() calls it. This is like the .next() call but slightly cleaner 3. Use @subdag to create a bunch of them + the downstream functions (messy but this is kind of nice if you're doing similar transformations in each module and want to share/not repeat logic, but this is a more advanced version of (1)). 4. Add a non-hamilton decorator that injects a parameter into the function when it is called (a little more verbose) 5. Do a variant of (1) but a full wrapper that delegates to a new object on a "finalized" call -- e.g. when performing the side-effect. Think something like copy-on-write, but a little more tailored towards use-cases

Honestly? Not sure if any of these make you happy -- happy to walk you through them in more detail/prototype. Looking above, none of them would be quite as simple to solidify the use-case as supporting it the way you want to :) I'm partial to (1), as these objects are supposed to be separate, and I'm not convinced the DAG shouldn't think of them as that. I also like (4), if you don't mind the instantiation of the object to be shared with the function.

Pro:

  • Maybe we should support side-effects, especially if we have insight/visibility (@forget() is a nice way to tell the framework what's a side-effect and what's not.

Thoughts?

elijahbenizzy avatar Mar 02 '23 21:03 elijahbenizzy

My use-case isn't a dataframe problem so I understand if it doesn't fit the Hamilton philosophy. Specifically, I have to instantiate a docx.Document and pass it a template from my local filesystem. It's two lines of code that I end up repeating in a lot of node functions. It's kind of a side-effect, but I think a similar use-case would be a node that returns the current time every time it is passed as a dependency to, say, understand the execution order and clock time of the DAG. Is that a side-effect?

I think option 2 could solve my problem fairly cleanly. The dependencies would be typed correctly so that information would be there for the user to see. And calling .get isn't that bad. But if this is still under consideration, I like the @forget decorator.

gravesee avatar Mar 03 '23 12:03 gravesee

My use-case isn't a dataframe problem so I understand if it doesn't fit the Hamilton philosophy. Specifically, I have to instantiate a docx.Document and pass it a template from my local filesystem. It's two lines of code that I end up repeating in a lot of node functions. It's kind of a side-effect, but I think a similar use-case would be a node that returns the current time every time it is passed as a dependency to, say, understand the execution order and clock time of the DAG. Is that a side-effect?

I think option 2 could solve my problem fairly cleanly. The dependencies would be typed correctly so that information would be there for the user to see. And calling .get isn't that bad. But if this is still under consideration, I like the @forget decorator.

Yep -- I don't think yours is too far out of the ordinary. Hamilton was built initially for dataframes, but we've seen a wide array of use-cases here. Now that I'm thinking about it more, instead of @forget, I'd propose @resource that takes in some sort of policy. Again, drawing from pytest fixtures :)

@resource(policy=FORGET)
def document() -> docx.Document:
    # how you want it to behave
    ...
    
@resource(policy=CACHE)
def document() -> docx.Document:
    # how it currently behaves
    ...

Re: side effects, I think yours barely count -- its more state. In which case, its just a way to make it easier to produce something of the given state, as in after the dependent node gets executed the state becomes immutable. Overall its side-effect free, so maybe its a reasonable way to approach it?

I'd say that the DAG-tracking example should probably be some other sort of hook (E.G. through graph adapters/whatnot) rather than a dependency -- IMO it has the potential to confuse the user when visualizing the DAG... We've prototyped/are working with stuff that does exactly this internally if you're curious.

@skrawcz thoughts on the resource decorator?

elijahbenizzy avatar Mar 03 '23 17:03 elijahbenizzy

Just thinking through all the ways and what the code would look like -- here's the infinite generator way you describe (sorry had to write it out):

from typing import Generator


def no_memo() -> Generator[dict, None, None]:
    def dict_generator():
        yield {}
    return dict_generator


def func1(no_memo: Generator[dict, None, None]) -> dict:
    _no_memo = next(no_memo()) 
    _no_memo['func1'] = 'val1'
    return _no_memo


def func2(no_memo: Generator[dict, None, None]) -> dict:
    _no_memo = next(no_memo())
    _no_memo['func2'] = 'val2'
    return _no_memo


if __name__ == "__main__":
    import __main__ as main

    from hamilton.driver import Driver
    from hamilton.base import SimplePythonGraphAdapter, DictResult

    dr = Driver({}, main, adapter=SimplePythonGraphAdapter(DictResult))
    res = dr.execute(["func1", "func2"])

    print(res)
    assert res['func1'] == {"func1": "val1"}
    assert res['func2'] == {"func2": "val2"}

We could have some smarts around knowing the types and if it's a (basic) generator, to call next() and pass that in, e.g. effectively enable this API:

from typing import Generator


def no_memo() -> Generator[dict, None, None]:
    def dict_generator():
        yield {}
    return dict_generator


def func1(no_memo: dict) -> dict: 
    no_memo['func1'] = 'val1'
    return no_memo


def func2(no_memo: dict) -> dict:
    no_memo['func2'] = 'val2'
    return no_memo

Thoughts? i.e. if the function takes in the generator -- then don't do anything, else check that the annotation matches the type on the function, and if so, do next() in the framework? I like the generator approach because I think it makes it clear to anyone reading, that it's going to be called multiple times... (though I need to figure out a way to annotate things so my static analyzer doesn't complain).

skrawcz avatar Mar 03 '23 22:03 skrawcz

Just thinking through all the ways and what the code would look like -- here's the infinite generator way you describe (sorry had to write it out):

from typing import Generator


def no_memo() -> Generator[dict, None, None]:
    def dict_generator():
        yield {}
    return dict_generator


def func1(no_memo: Generator[dict, None, None]) -> dict:
    _no_memo = next(no_memo()) 
    _no_memo['func1'] = 'val1'
    return _no_memo


def func2(no_memo: Generator[dict, None, None]) -> dict:
    _no_memo = next(no_memo())
    _no_memo['func2'] = 'val2'
    return _no_memo


if __name__ == "__main__":
    import __main__ as main

    from hamilton.driver import Driver
    from hamilton.base import SimplePythonGraphAdapter, DictResult

    dr = Driver({}, main, adapter=SimplePythonGraphAdapter(DictResult))
    res = dr.execute(["func1", "func2"])

    print(res)
    assert res['func1'] == {"func1": "val1"}
    assert res['func2'] == {"func2": "val2"}

We could have some smarts around knowing the types and if it's a (basic) generator, to call next() and pass that in, e.g. effectively enable this API:

from typing import Generator


def no_memo() -> Generator[dict, None, None]:
    def dict_generator():
        yield {}
    return dict_generator


def func1(no_memo: dict) -> dict: 
    no_memo['func1'] = 'val1'
    return no_memo


def func2(no_memo: dict) -> dict:
    no_memo['func2'] = 'val2'
    return no_memo

Thoughts? i.e. if the function takes in the generator -- then don't do anything, else check that the annotation matches the type on the function, and if so, do next() in the framework? I like the generator approach because I think it makes it clear to anyone reading, that it's going to be called multiple times... (though I need to figure out a way to annotate things so my static analyzer doesn't complain).

Intriguing, but this feels like a messy thing to put on the framework. I don't really agree that it's clear that something is getting called multiple times (maybe this is because the Generator type parameters are super confusing in python). We don't have any other instances where the type of a function determines how it should be called -- its all been in the decorators.

IMO if we're going to start doing more magical meta-programming, we want to limit the space to reduce cognitive burden. Decorators are meant to do that, but if we also change the output type and do something special dependent on that its just more moving pieces to figure out how a dataflow should work.

elijahbenizzy avatar Mar 04 '23 20:03 elijahbenizzy

Intriguing, but this feels like a messy thing to put on the framework.

It's not more messy than the alternatives to have the framework deal with this.

Decorators are meant to do that,

Good idea. This looks simpler:

@generator
def no_memo() -> dict:
    return {} # or perhaps this should be `yield {}` ?

def func1(no_memo: dict) -> dict: 
    no_memo['func1'] = 'val1'
    return no_memo

def func2(no_memo: dict) -> dict:
    no_memo['func2'] = 'val2'
    return no_memo

skrawcz avatar Mar 04 '23 21:03 skrawcz

FWIW I had this same need in my own framework. I attached a handle to the generator for each method field, and it autowired itself with upstream function arguments the same way you have done. E.g. func2.generator() would directly invoke the created generator, which in turn would pull in output from the no_memo generator to satisfy its named arguments, etc.

In my version, all functions behaved more or less like generators. Standard functions that returned values would just have those values cached and re-used on subsequent calls. Those repeated objects were read only.

jdonaldson avatar Mar 07 '23 18:03 jdonaldson

FWIW I had this same need in my own framework. I attached a handle to the generator for each method field, and it autowired itself with upstream function arguments the same way you have done. E.g. func2.generator() would directly invoke the created generator, which in turn would pull in output from the no_memo generator to satisfy its named arguments, etc.

In my version, all functions behaved more or less like generators. Standard functions that returned values would just have those values cached and re-used on subsequent calls. Those repeated objects were read only.

Oh awesome -- curious, is your framework open-source? Would love to compare notes. And I like that idea of all functions behaving like generators -- it has a clean feel to it, and naturally implements memoization. I guess the interesting thing is that these are generators with two modes:

  1. Return a fixed value (E.G. cached, infinite generator returning the same thing)
  2. Return a new value each time (Uncached, infinite generator being recalled)

Whereas the generator abstraction allow for a lot more (different values every time, etc...) So I would argue that implementing them using generators might sense, but exposing that to the user might be a little confusing? But yeah! Would love to see what you came up with!

elijahbenizzy avatar Mar 07 '23 19:03 elijahbenizzy

In my version, all functions behaved more or less like generators. Standard functions that returned values would just have those values cached and re-used on subsequent calls. Those repeated objects were read only.

@jdonaldson Did you end up having a chain of them? If so, how did that work and what kind of application did that enable?

skrawcz avatar Mar 07 '23 19:03 skrawcz

Yeah, it's a chain.

It worked by precomputing the dag from the field names (as you have done). However, It would also augment each decorated method with its own generator function that resolved the specific path through the dag, and then yielded data through the chain of functions itself as an iterator.

I'm checking to see if you see a way to integrate this functionality at a high level. If one can stream in this fashion, it's possible to build really sophisticated pipelines for training and serving from a single method. This is totally my jam for doing my modeling work, and I've seen so many people "almost" get it right (dagster). The streaming/yielding capability is key. I'm stubborn on that, and totally flexible on pretty much everything else.

I'm not partial in how one exposes the generator from a decorated DAG node. I just think it makes sense to have a "generator" function field, so it's clear what its basic behavior is. If you have better ways of peeling the banana here, I'm all ears.

jdonaldson avatar Mar 07 '23 23:03 jdonaldson

The application is basically any kind of composable DAG structure for ML pipelines. I've used them to sketch out complex DS transformations, use them to generate notebooks, investigate ML model behaviors, and then deploy training/inference endpoints. It's my one shop approach for all this. I'd really love to find a home for the technique that is not a totally new project.

jdonaldson avatar Mar 07 '23 23:03 jdonaldson

Also, I'm sure this is named after Hamilton Paths, which is a node traversal in a graph. However it's not specifically DAG related, is it? I might be missing a reference there.

jdonaldson avatar Mar 07 '23 23:03 jdonaldson

It worked by precomputing the dag from the field names (as you have done). However, It would also augment each decorated method with its own generator function that resolved the specific path through the dag, and then yielded data through the chain of functions itself as an iterator.

I'm checking to see if you see a way to integrate this functionality at a high level. If one can stream in this fashion, it's possible to build really sophisticated pipelines for training and serving from a single method. This is totally my jam for doing my modeling work, and I've seen so many people "almost" get it right (dagster). The streaming/yielding capability is key. I'm stubborn on that, and totally flexible on pretty much everything else.

I'm not partial in how one exposes the generator from a decorated DAG node. I just think it makes sense to have a "generator" function field, so it's clear what its basic behavior is. If you have better ways of peeling the banana here, I'm all ears.

We have a bunch of flexibility with how Hamilton is set up. We have some plans to refactor some internals too, which if we want to enable chaining generators, we'd need to do. So happy to think about how to approach this. Otherwise our style of development is to design an API we'd want users to use and see how that looks and work from there. Do you have a toy use case we could try to code up?

Also, I'm sure this is named after Hamilton Paths, which is a node traversal in a graph. However it's not specifically DAG related, is it? I might be missing a reference there.

Origin story of name is three fold:

  • we were working with the FED team at Stitch Fix - i.e. forecasting, estimation, and demand team. This was going to be key for them -- what's a "foundational" association with the FED? Alexander Hamilton.
  • The FED team modeled the stitch business, and physics has "Hamiltonian" notions to help model the world.
  • We're doing some basic graph theory, which has association with "Hamiltonian" things too.

skrawcz avatar Mar 07 '23 23:03 skrawcz

Ok, I'm sold on the name if it has anything to do with Alexander Hamilton. Graph connections are a bonus. :)

I'll put the code together in a repo here soon with a notebook showing its use. I'll post a link to it here.

jdonaldson avatar Mar 08 '23 16:03 jdonaldson

Ok, I'm sold on the name if it has anything to do with Alexander Hamilton. Graph connections are a bonus. :)

I'll put the code together in a repo here soon with a notebook showing its use. I'll post a link to it here.

Awesome! Oh, man, I'm going to have to add this story in the documentation, but...

When we were coming up with Hamilton, I prototyped a framework (that looked a lot like dagster/burr) -- E.G. where you defined transformations and you wired them together separately, as opposed to Hamilton where they're the same. I called it "burr", and the team we were presenting it to liked "hamilton" (@skrawcz's design) a lot better :)

elijahbenizzy avatar Mar 08 '23 16:03 elijahbenizzy

LOL, I'm sure methods will need to be written a certain way with this approach... Instead of Ruby on Rails, I propose we call the approach Hamilton on Ice.

jdonaldson avatar Mar 08 '23 19:03 jdonaldson

LOL, I'm sure methods will need to be written a certain way with this approach... Instead of Ruby on Rails, I propose we call the approach Hamilton on Ice.

Hahahaha love it!

elijahbenizzy avatar Mar 08 '23 23:03 elijahbenizzy

Ok, here's a simple example showing how the generator might work : https://github.com/jdonaldson/hamilton-on-ice/blob/main/notebooks/Streaming%20Example.ipynb

jdonaldson avatar Mar 09 '23 23:03 jdonaldson

@artifacts are basically functions with a single return. Those get repeated each iteration. I use those for things like config and model artifacts.

jdonaldson avatar Mar 09 '23 23:03 jdonaldson

I'll do a proper streaming example next, just wanted to show baby steps. I added in a small utility to render a dag there at the bottom.

jdonaldson avatar Mar 09 '23 23:03 jdonaldson

@jdonaldson took a look. Looks quite interesting!

It seems like you're using classes like Hamilton uses modules?

Things I'm thinking about:

  • with Hamilton, we can "inject" decorators as we walk the DAG, so would we need to annotate e.g. @jsonl directly?
  • what's the "driver" API that we'd want here.
  • we should set up time to chat more about the API and experience we'd want someone to have -- and importantly gotchas to avoid/ways to help people better debug/understand errors.

Otherwise a more realistic example would help me grok usage better.

skrawcz avatar Mar 14 '23 03:03 skrawcz

Thanks for checking it out! It's definitely more convoluted for a basic training/transform usecase, and this definitely raises a needed cause for concern. The payoff I'm arguing for here is that streaming interfaces for DAGs can be used for both training and inference pipelines, as long as one can accept a parameterized minibatch technique for training and serving scenarios. The rest of the implementation here tries to serve that goal, and who knows? Maybe you know a better way to serve the underlying need. I'm not really married to my implementation at all. It just works more or less in a useful way for me for now.

I have some thoughts/suggestions here on your questions:

with Hamilton, we can "inject" decorators as we walk the DAG, so would we need to annotate e.g. @jsonl directly?

The choice of decoration there gives one more control over what specifically happens at that step. For instance, one may stream results of one transformation to a series of featherized panda files on s3, while another transformation may need to go to a database. Then, when the pipeline is deployed, everything needs to be serialized in a different way with credentials located in the ENV. I really don't want a framework to solve everything here for me. I'm happy with a decorator I can sprinkle around on methods, and wire the complex behaviors with custom python code. I run into so much weird stuff sometimes interacting with different CRUD services, and this is a great way of dealing with it. That said, the class oriented foundation of the code here permits hierarchical configuration with decorators. One can decorate methods, or simply decorate the class, which would decorate the methods.

what's the "driver" API that we'd want here.

I think I know what you mean, but there's a number of potentially different of meanings for driver. Let's chat.

we should set up time to chat more about the API and experience we'd want someone to have -- and importantly gotchas to avoid/ways to help people better debug/understand errors.

Exactly! Happy to do so.

jdonaldson avatar Mar 17 '23 21:03 jdonaldson

OK, I had some thoughts about this, but I wanted to move issues as we've moved a bit beyond this one :) I do think the implementation of this issue could be solved by the one in #19, but that's a slightly more general case so I'm going to pen my thoughts there. Its just one API -- I think there are other approaches as well...

elijahbenizzy avatar Mar 25 '23 23:03 elijahbenizzy

I think we got some of this functionality with the parallelizeable / collect constructs. Please re-open if not.

skrawcz avatar Jul 18 '24 18:07 skrawcz

Yeah I need to play with the new stuff there, I think you more or less have it with some of your recent features, and I just need to pull it together again at some point.

jdonaldson avatar Jul 18 '24 23:07 jdonaldson