tags for externally provided nodes
Is your feature request related to a problem? Please describe.
In many applications we expect the user to provide inputs to change the behavior of the DAG, e.g. config.when.... The Hamilton driver will display errors when these external dependencies aren't provided, but, AFIAK, there is no way to document these for the user or query them eventually for their metadata to, for example, build a config file skeleton.
Describe the solution you'd like The ability to add tags to externally provided inputs. That is, dependencies of node functions that aren't defined anywhere but in the function signatures. Something that let's me signify in code that these are expected inputs of node functions and here is some information about them.
@tag({...})
@external
def input_data_path():
pass
Describe alternatives you've considered
I thought about something like this, but there is still a disconnect with the tagged node and the input the user needs to provide.
@tag({...})
def input(raw_input: str) -> str:
return raw_input
Additional context My team wants to expose Hamilton by pre-creating a variety of configurable DAGs. The DAGs would depend on some inputs from the user. We would like to query the nodes-without-module-tags (which all seem to be external) to generate helpful documentation and context about what is expected from the user.
OK, this is interesting. Let me make sure I get your use-case:
- you want to be able to know, programmatically, which nodes are already in the DAG and which nodes should be supplied externally
- currently you're thinking about using the
moduletag/querying for it (which is supobtimal) - you might also want to do this for
@config.when/configuration-driven DAG shaping - And you want to be able to process this and give to your users
Right?
As a first note, I think you can do what you're trying to do with module tag now, although its a little ugly -- from the hello-world:
[item for item in driver.Driver({}, my_functions).list_available_variables() if 'module' not in item.tags]
[Variable(name='spend', type=<class 'pandas.core.series.Series'>, tags={}), Variable(name='signups', type=<class 'pandas.core.series.Series'>, tags={})]
My thinking is that there shouldn't be a need to do this in code, unless you want to say "this input should always come from the outside world". Instead, you should be able to query the driver for nodes/configuration items that are needed. This would basically be adopting this list_available_variables functionality and adding an additional item to say the "source" of this (which is information we already track, and would just need to wire through). E.G.
dr = driver.Driver(config={...}, modules=...)
external_inputs = [var for var in dr.list_available_variables() if var.is_external_node]
So, thoughts on that type of API?
the following is less likely to be relevant but I'm just thinking through some edge-cases
Note that there's a caveat -- the external inputs that config needs are of a different class, as they're used to shape the DAG. E.G. you can choose between two different functions, which could have two different sets of inputs:
def load_file_from_s3(s3_url: str) -> ...
...
def load_file_from_disk(disk_path: str) -> ...
...
That said, I think you probably don't want to expose conditional inputs to your users, and if you do, you'll likely want them as two different sets of inputs (rather than "if you use X config item then choose between Y and Z).
Thanks for the response. I may have muddied the waters by including a reference to config.when. I included that as an example that often, in my usage anyway, the values tested for in config.when are provided as input from the user (usually a literal, in my cases)
dr = driver.Driver(config={...}, modules=...) external_inputs = [var for var in dr.list_available_variables() if var.is_external_node]
Your example above is very close to what I am requesting with the added feature of querying the tags of the returned list of nodes. Right now, AFIAK, there is no way to add tag data to such nodes. What I'd like to do is something like this:
help = {}
for n in external_inputs:
help[n.name] = n.tags['about']
Hopefully that makes the question more clear.
@gravesee If I understand correctly, you just want to add metadata?
E.g. do you want to do something like this -- well this is one option -- directly encode it and have Hamilton pull it for metadata only:
@tag({...})
@external
def input_data_path():
"""This describes some input that's required"""
pass
def load_data(input_data_path: str) -> ...
"""This function uses `input_data_path`"""
Or, the other option -- build a DAG and then attach metadata to it, much like you and Elijah have demonstrated.
Curious, how would that then get consumed? Is one of these better for that than the other?
At a higher level, it seems like you just want to have a documented "input schema" essentially? Could also perhaps use the python doc string for the parameter, get more information out for it too?
E.g. something like...
some_schema = SomeObjectSchema("... making this up .. but this could list vaild values, a function to run before hand, maybe some docs")
@expect_input(input_data_path=some_schema)
def load_data(input_data_path: str) -> ...
"""Doc string for load_data
:input_data_path: documentation for `input_data_path` that we could pull out for metadata for it?
"""
# and then in the driver we'd know to pull the above information to annotate the "external" i.e. "user defined" node?
Obviously if you need to attach things programmatically then this isn't the solution for that. But just a thought.
Riffing some more -- combining the above two:
@tag({...})
@external # says this is a validation function for this particular input
def input_data_path(value: str) -> str : # types should match expected type input
"""This describes in more detail what is or isn't required -- and sphinx docs could expose"""
assert value in ['foo', 'bar', 'baz'], f"invalid value passed {value}"
return value
def load_data(input_data_path: str) -> ...
"""This function uses `input_data_path`"""
Yes! This is succinctly put and what I am looking for: How to document the fields that are expected to be provided by the user. I can get a list of upstream nodes, but all I get is a name and a type. How do I understand the intent of what should be provided to make the system work?
Cool. @gravesee how do you want to expose that information to your users? (Just thinking if there's more functionality needed, etc.)
Cool. @gravesee how do you want to expose that information to your users? (Just thinking if there's more functionality needed, etc.)
More for developers to be able to access this information to do things like generate documentation. I could create a driver and pass in the my modules, Hamilton figures out which nodes ere external, and as a developer, I can use that information to create documentation or generate a yaml skeleton that a user could edit for later ingestion. I don't think Hamilton would have to do anything special with the metadata other than provide a mechanism for attaching it to dependencies that aren't defined as regular hamilton functions.
For more context on this specific project, I am generating word documents for model governance. I'm using hamilton to stitch together the various sections of this word document and manage the common sets of dependencies that are found in these sections (model name, performance tag, datasets, etc...) We have several classes of document that can be rendered and they require different subsets of inputs. I plan on exposing these different DAGs to the end-users (non-developers) by wrapping the hamilton driver in a user-facing class:
class ModelGovernanceXYZ:
def __init__(self):
self.final_vars = [....]
self.dr = driver.Driver({}, ...modules, adapter=adapter)
def execute(self, inputs: dict):
self.dr.execute(*self.final_vars, inputs=inputs)
def generate_yaml():
"""use external nodes list to generate yaml skeleton"""
pass
def generate_docs():
"""use external nodes list to generate docs from tags, e.g. markdown
pass
@classmethod
def from_yaml(self, yaml):
"""instantiate this class from a yaml file"""
pass
Cool. @gravesee how do you want to expose that information to your users? (Just thinking if there's more functionality needed, etc.)
More for developers to be able to access this information to do things like generate documentation. I could create a driver and pass in the my modules, Hamilton figures out which nodes ere external, and as a developer, I can use that information to create documentation or generate a yaml skeleton that a user could edit for later ingestion. I don't think Hamilton would have to do anything special with the metadata other than provide a mechanism for attaching it to dependencies that aren't defined as regular hamilton functions.
For more context on this specific project, I am generating word documents for model governance. I'm using hamilton to stitch together the various sections of this word document and manage the common sets of dependencies that are found in these sections (model name, performance tag, datasets, etc...) We have several classes of document that can be rendered and they require different subsets of inputs. I plan on exposing these different DAGs to the end-users (non-developers) by wrapping the hamilton driver in a user-facing class:
class ModelGovernanceXYZ: def __init__(self): self.final_vars = [....] self.dr = driver.Driver({}, ...modules, adapter=adapter) def execute(self, inputs: dict): self.dr.execute(*self.final_vars, inputs=inputs) def generate_yaml(): """use external nodes list to generate yaml skeleton""" pass def generate_docs(): """use external nodes list to generate docs from tags, e.g. markdown pass @classmethod def from_yaml(self, yaml): """instantiate this class from a yaml file""" pass
Cool use-case! I think this is a pretty reasonable ask, its more a question of the API we want to expose. I'm thinking something like a schema argument to driver -- E.G.
# Functions containing metadata on superset of possible inputs
import schema
# Add schema arg
dr = driver.Driver({}, *modules, schema=schema)
# Add `is_external_input` to var
print([item for item in dr.list_available_variables() if item.is_external_input()])
Then TBD on how the schema would look, but your approach makes a lot of sense. One extra thought could be to add the capability to do validation into it (either with check_output or maybe as part of the function body...).
For now, however, I think we can unblock you within the confines of the current framework + a quick addition.
- See https://github.com/DAGWorks-Inc/hamilton/pull/99 for giving that field to nodes -- this is possible now but this enables the API above.
- For attaching metadata to those "undefined/user-defined" nodes -- before we build that out, I would recommend storing a json/yaml/some other file with a map of all possible inputs/names -> whatever metadata you want to store (tags?), and you can use that in your class. This would effectively be a hand-rolled version of what we're suggesting here, that would get replaced by it.
Sounds reasonable as a start? I'm fine with #99 so if you like the approach we can probably release -- its a very simple change, but its slightly cleaner IMO than using the module tag, and is fully backwards compatible.
Yeah perhaps #99 + a custom map of tags to inputs you maintain is a way to unblock things?
From a requirements perspective I'd like to round out what exactly we need:
- The ability to provide documentation for inputs (perhaps we already have this with doc string of functions that need it).
- The ability to provide tags for inputs.
- The ability to validate a schema for inputs.
- The ability to easily get the "input" nodes to a DAG. (#99)
- The ability to generate documentation, given a DAG, or some object container to do so -- e.g. generate .rst or .md files? (would this be helpful?)
Anything else?
It sounds like (2) and (4) are the minimum that you need @gravesee ?
Yes, that is correct. Providing functionality for (2) and (4) would let me do a lot
For attaching metadata to those "undefined/user-defined" nodes -- before we build that out, I would recommend storing a json/yaml/some other file with a map of all possible inputs/names -> whatever metadata you want to store (tags?), and you can use that in your class. This would effectively be a hand-rolled version of what we're suggesting here, that would get replaced by it.
This is the approach I was going down, but it was a lot of duplication.
Yes, that is correct. Providing functionality for (2) and (4) would let me do a lot
For attaching metadata to those "undefined/user-defined" nodes -- before we build that out, I would recommend storing a json/yaml/some other file with a map of all possible inputs/names -> whatever metadata you want to store (tags?), and you can use that in your class. This would effectively be a hand-rolled version of what we're suggesting here, that would get replaced by it.
This is the approach I was going down, but it was a lot of duplication.
Got it -- another approach you could do to unblock is build exactly that schema -- E.G. schema.py, with functions that have tags, etc... as well as the correct types.
This is a module you'd need just for documentation. Then you can instantiate a second instance of the driver, and gather all the metadata using that one -- filtering where the module field in the tags is equal to schema. You join that with the external inputs from the first driver...
Not ideal, but it offers a few advantages:
- Specifies metadata for the input in the form you want (hamilton functions)
- Prepares you/gives us a spec for what we're thinking about implementing here
Does that make sense/seem reasonable?
@gravesee we are going to get (4) out this week. Still thinking about how to best do (2).
@gravesee how are things -- we haven't done tags on inputs yet -- did you manage to find a solution here? or?