flowpipe
flowpipe copied to clipboard
Slow call `get_hash` in `NodeEncoder`
Is your feature request related to a problem? Please describe.
In our project some graphs are time critical, when profiling the code the NodeEncoder
arose as a bottleneck.
Describe the solution you'd like
Can we think of something else to serialize nodes without using the sha256
of the hashlib
?
Describe alternatives you've considered
Maybe we can use the __repr__
, this method describe the object in a single line.
Do you guys have alternatives?
It should not be too hard to make the hashing function used there configurable. There is no specific reason I picked the sha256
except for it being quite common and available in the python standard library. To me, one advantage of the neat representations of flowpipe graphs is to easily tell when to graphs are the same (after execution), so a simple __repr__
which falls back to the in-memory location of python objects that didn't define it would not fit the bill. But it could make sense as a default, since every object supports it!
The NodeEncoder
is used to provide something json compatible from generic python objects. A quick look at the code shows me that it should only get called in the Node.list_repr()
or Graph.list_repr()
methods. I might have missed something, though.
Can you maybe elaborate a little bit on the circumstance in which you find the slowdown from the NodeEncoder? Maybe this is a pointer to something else that needs improvement.
I agree on why you picked the sha256
and I also agree that this is a fine way because it indeed shows if objects are the same.
If it should only be called by Node.list_repr()
and Graph.list_repr()
than indeed something is going on. I inly use the function when I'm debugging, but thereafter I never call them again. However the NodeEncoder
shows up many times in my profiler when running whitout these calls.
I will look why they are called and where :) I keep you updated!
So I profiled my code and saw that it's being called in:
https://github.com/PaulSchweizer/flowpipe/blob/125e5bddc7e2edcd14a28af20334bdf69f54db1b/flowpipe/plug.py#L60-L66
This function is called every time a value is updated. I don't think this is expected behavior isn't it?
Oh, yes it is. This piece of logic makes sure that a plug/node stays clean if you set the same input again. Thay way, when you re-run the graph and only parts of the inputs change, you can simply set all of the values, but only the altered part of the graph re-runs.
Would a way to override the hash function used here help you out? Alternatively, I think you might be able to get a dirty quick fix by monkey-patching the utilities.get_hash function before importing anything else from fowpipe
import flowpipe.utilities
def null_hash(*args, **kwargs):
return None
flowpipe.utilities.get_hash = null_hash
import flowpipe....
I haven't tried it though, it might not work at all.
Hmm, again, you're right :p I don't use it that frequently (updating an existing graph), so that's why I didn't thought about it I think.. Maybe it's due to my other issue #135, were I frequently re-create Graphs because they are inside Nodes.
It would be easy to make that hash function a constant in the flowpipe.utilities
package and make use of python's lazy evaluation for convenient overriding. I'm talking about
## flowpipe/utilitied.py
HASH_FUNC = hash_func
....
def hash_func(...):
...
and
## flowpipe/plug.py
import .utilities
...
def _update_value(self, value):
old_hash = utilities.HASH_FUNC(self.value)
new_hash = utilities.HASH_FUNC(value)
...
This way, users like you, that are slowed down by this evaluation (or work with objects where the current hashing function doesn't work), can override the hash function conveniently, by just setting
flowpipe.utilities.HASH_FUNC = my_hash_func
... # rest of the code
Do you think that will help you? It feels pythonic to me, since it's easy to explain and explicit. Dynamically generating graphs is fine, and should be fine, too. Not nesting them in nodes will not help with this issue, at least not as far as I can see.