hamilton
hamilton copied to clipboard
Enable one to mark a function that should be recreated for each parallel block
Is your feature request related to a problem? Please describe. The S3 client below will not serialize with Ray (and probably other task executors)
def s3_client() -> boto.client:
return boto3.client("s3") # somethign like that
def task(file_path: str) -> Parallelizeable[pd.Series]:
for idx, row in pd.read_csv(file_path):
yield row
def image(task: pd.Series, s3_client: boto.client) -> img:
_image = s3_client.get(task.url)
return _image
def finalize(image: Collect[img]) -> ...:
return ....
We should probably be able to mark the function to know that its result shouldn't be serialized for parallelization.
Describe the solution you'd like We should be able to mark it in a way and therefore know that it should now become part of the graph within the task...
@unserializeable
def s3_client() -> boto.client:
return boto3.client("s3") # somethign like that
def task(file_path: str) -> Parallelizeable[pd.Series]:
for idx, row in pd.read_csv(file_path):
yield row
def image(task: pd.Series, s3_client: boto.client) -> img:
_image = s3_client.get(task.url)
return _image
def finalize(image: Collect[img]) -> ...:
return ....
Describe alternatives you've considered You make it part of the parallel block:
def task(file_path: str) -> Parallelizeable[pd.Series]:
for idx, row in pd.read_csv(file_path):
yield row
def s3_client(task: pd.Series) -> boto.client:
"""Do nothing with the input"""
return boto3.client("s3") # something like that
def image(task: pd.Series, s3_client: boto.client) -> img:
_image = s3_client.get(task.url)
return _image
def finalize(image: Collect[img]) -> ...:
return ....
Additional context Not a burning desire, but it would aesthetically look cleaner...
This is pretty similar to this issue: https://github.com/DAGWorks-Inc/hamilton/issues/90