hdmf [perf] Increase test performance by 274% with this one weird trick...

Motivation

pynwb tests take half an hour to run
importing pynwb takes a long time because it has to load namespaces on import (see https://github.com/NeurodataWithoutBorders/pynwb/pull/1931 )

How to test the behavior?

import pynwb

(run the nwb unit tests)

Description

Profiling the pynwb tests, it turns out that a single deepcopy call accounted for almost 2/3 of the run time - the one within hdmf.spec.spec.ConstructableDict.build_const_args.

This method is called in three places:

https://github.com/hdmf-dev/hdmf/blob/dfb1df79bce3e34c52af7996429c3f561d96390a/src/hdmf/spec/spec.py#L93
https://github.com/hdmf-dev/hdmf/blob/dfb1df79bce3e34c52af7996429c3f561d96390a/src/hdmf/spec/spec.py#L149
https://github.com/hdmf-dev/hdmf/blob/dfb1df79bce3e34c52af7996429c3f561d96390a/src/hdmf/spec/spec.py#L630

When running the tests, I inserted an inspect call to capture the stack traces to see what the higher-order things that call this were, and the only methods that call this one are:

NamespaceCatalog.__load_namespace: https://github.com/hdmf-dev/hdmf/blob/dfb1df79bce3e34c52af7996429c3f561d96390a/src/hdmf/spec/namespace.py#L433
NamespaceToBuilderHelper.__copy_spec: https://github.com/hdmf-dev/hdmf/blob/dfb1df79bce3e34c52af7996429c3f561d96390a/src/hdmf/backends/utils.py#L83

load_namespace is only called when specs are loaded from a file, so a deepcopy is unnecessary here: the deepcopy protects the object being copy from mutation downstream, but the upstream object is discarded immediately and comes from a file (which is not mutated)

the call tree above __copy_spec is linear with one fork at the end:

NamespaceToBuilderHelper.convert_namespace
HDF5IO.__cache_spec
- HDF5IO.export
- HDF5IO.write

neither of which mutate the object after the deepcopy operation should get called, as far as I can tell.

The thing that we would want to protect is some downstream consumer of the spec from mutating the spec (ie. it should always be a veridical reflect the spec source), but the deepcopy doesn't protect from that since it's only called when instantiating these spec objects.

When running the tests, however, i got a bunch of errors that i couldn't quite understand, but it seemed like it must be some confused equality comparison, and I noticed that the Spec.__hash__ method was set as being just the id of the object - usually dicts aren't hashable bc they are mutable, but that's usually not a problem, except it is here because it seems like these Spec objects are often used as keys in a dictionary.

it seems like the primary thing that deepcopy was doing is giving the object a new id - which is confirmed by simply using copy instead of deepcopy or hash(str(self)) as the __hash__ function. So it seems like in some places we are forwarding the objects around and getting "hash collisions" or "hash misses" when using id.

SO i tried replacing it with the cheapest hash function money can buy, the aforementioned hash(str(self)) (which is not ideal lol because it violates the other requirement of __hash__), but just as a demo of what the problem is and what a potential solution is bc y'all know better how these classes are used...

Results

So anyway, when running pynwb tests we go from this:

to this:

which is fun. (the errors are because i didn't have pytest installed)

this impact is way more dramatic in something like pynwb where we are loading a ton of complex specs over and over again than it is in the hdmf tests, but it is a pretty dramatic perf improvement for basically free, thought y'all woudl be interested.

Checklist

I have to run out the door rn but i'll return later to finish up any quality checks, but anyway mostly just a "hey here's free perf" pull request as a question bc i figure there are reasons this is not good to do (or you might want a "real hash function")

[ ] Did you update CHANGELOG.md with your changes?
[x] Does the PR clearly describe the problem and the solution?
[ ] Have you reviewed our Contributing Guide?
[ ] Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

ttyl xoxo <3

Jul 16 '24 23:07 sneakers-the-rat

oh lmao is ee this exact thing was discussed previously SORRY. always search first

https://github.com/hdmf-dev/hdmf/pull/1103

Jul 16 '24 23:07 sneakers-the-rat

@sneakers-the-rat Can I close this since this is discussed in another PR?

Jul 22 '24 15:07 mavaylon1

Its got a bit of a different diagnosis of the problem that might be useful in deciding the route forward, but idrc its like 3 lines :)

Jul 22 '24 18:07 sneakers-the-rat

Maybe a positive control test to do is to swap out the base class with a frozendict or something to test if the spec classes are being mutated in the way that would be a problem, but I did run the pynwb tests against this PR and the tests passed, and I assume we would see the problem of unexpected mutation pop up somewhere in there if it was happening.

Jul 22 '24 18:07 sneakers-the-rat

This can probably get closed since #1103 was merged, but ya just fyi there is something funny going on with that hash function that might be causing other problems, since the need for a copy operation seems to be directly related to using id() as hash()

Aug 19 '24 20:08 sneakers-the-rat

Discussion note: I will take a closer look. We're not sure whether this is necessary, but it shouldn't hurt to add this in.

Sep 05 '24 20:09 rly

hdmf hdmf copied to clipboard

[perf] Increase test performance by 274% with this one weird trick...

Motivation

How to test the behavior?

Description

Results

Checklist

hdmf
hdmf copied to clipboard