hdmf
hdmf copied to clipboard
[perf] Increase test performance by 274% with this one weird trick...
Motivation
- pynwb tests take half an hour to run
- importing pynwb takes a long time because it has to load namespaces on import (see https://github.com/NeurodataWithoutBorders/pynwb/pull/1931 )
How to test the behavior?
import pynwb
(run the nwb unit tests)
Description
Profiling the pynwb tests, it turns out that a single deepcopy call accounted for almost 2/3 of the run time - the one within hdmf.spec.spec.ConstructableDict.build_const_args.
This method is called in three places:
- https://github.com/hdmf-dev/hdmf/blob/dfb1df79bce3e34c52af7996429c3f561d96390a/src/hdmf/spec/spec.py#L93
- https://github.com/hdmf-dev/hdmf/blob/dfb1df79bce3e34c52af7996429c3f561d96390a/src/hdmf/spec/spec.py#L149
- https://github.com/hdmf-dev/hdmf/blob/dfb1df79bce3e34c52af7996429c3f561d96390a/src/hdmf/spec/spec.py#L630
When running the tests, I inserted an inspect call to capture the stack traces to see what the higher-order things that call this were, and the only methods that call this one are:
NamespaceCatalog.__load_namespace: https://github.com/hdmf-dev/hdmf/blob/dfb1df79bce3e34c52af7996429c3f561d96390a/src/hdmf/spec/namespace.py#L433NamespaceToBuilderHelper.__copy_spec: https://github.com/hdmf-dev/hdmf/blob/dfb1df79bce3e34c52af7996429c3f561d96390a/src/hdmf/backends/utils.py#L83
load_namespace is only called when specs are loaded from a file, so a deepcopy is unnecessary here: the deepcopy protects the object being copy from mutation downstream, but the upstream object is discarded immediately and comes from a file (which is not mutated)
the call tree above __copy_spec is linear with one fork at the end:
NamespaceToBuilderHelper.convert_namespaceHDF5IO.__cache_specHDF5IO.exportHDF5IO.write
neither of which mutate the object after the deepcopy operation should get called, as far as I can tell.
The thing that we would want to protect is some downstream consumer of the spec from mutating the spec (ie. it should always be a veridical reflect the spec source), but the deepcopy doesn't protect from that since it's only called when instantiating these spec objects.
When running the tests, however, i got a bunch of errors that i couldn't quite understand, but it seemed like it must be some confused equality comparison, and I noticed that the Spec.__hash__ method was set as being just the id of the object - usually dicts aren't hashable bc they are mutable, but that's usually not a problem, except it is here because it seems like these Spec objects are often used as keys in a dictionary.
it seems like the primary thing that deepcopy was doing is giving the object a new id - which is confirmed by simply using copy instead of deepcopy or hash(str(self)) as the __hash__ function. So it seems like in some places we are forwarding the objects around and getting "hash collisions" or "hash misses" when using id.
SO i tried replacing it with the cheapest hash function money can buy, the aforementioned hash(str(self)) (which is not ideal lol because it violates the other requirement of __hash__), but just as a demo of what the problem is and what a potential solution is bc y'all know better how these classes are used...
Results
So anyway, when running pynwb tests we go from this:
to this:
which is fun. (the errors are because i didn't have pytest installed)
this impact is way more dramatic in something like pynwb where we are loading a ton of complex specs over and over again than it is in the hdmf tests, but it is a pretty dramatic perf improvement for basically free, thought y'all woudl be interested.
Checklist
I have to run out the door rn but i'll return later to finish up any quality checks, but anyway mostly just a "hey here's free perf" pull request as a question bc i figure there are reasons this is not good to do (or you might want a "real hash function")
- [ ] Did you update
CHANGELOG.mdwith your changes? - [x] Does the PR clearly describe the problem and the solution?
- [ ] Have you reviewed our Contributing Guide?
- [ ] Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?
ttyl xoxo <3
oh lmao is ee this exact thing was discussed previously SORRY. always search first
https://github.com/hdmf-dev/hdmf/pull/1103
@sneakers-the-rat Can I close this since this is discussed in another PR?
Its got a bit of a different diagnosis of the problem that might be useful in deciding the route forward, but idrc its like 3 lines :)
Maybe a positive control test to do is to swap out the base class with a frozendict or something to test if the spec classes are being mutated in the way that would be a problem, but I did run the pynwb tests against this PR and the tests passed, and I assume we would see the problem of unexpected mutation pop up somewhere in there if it was happening.
This can probably get closed since #1103 was merged, but ya just fyi there is something funny going on with that hash function that might be causing other problems, since the need for a copy operation seems to be directly related to using id() as hash()
Discussion note: I will take a closer look. We're not sure whether this is necessary, but it shouldn't hurt to add this in.