test(nns): Analyze list_neurons performance, and look for promising areas for optimization.

Open daniel-wong-dfinity-org opened this issue 1 year ago • 0 comments

Usage

bazel test \
    --test_env=SSH_AUTH_SOCK \
    --test_output=streamed \
    --test_arg=--nocapture \
    //rs/nns/integration_tests:why_list_neurons_expensive \
&& cp -i \
  bazel-bin/rs/nns/integration_tests/test-127765624/why_list_neurons_expensive.runfiles/ic/list_neurons_*.svg \
  ./

(You need to be able to scp from [email protected].)

Limitations

The method that you want to measure must be an update, not a query. You can change a query to an update to measure it; however, make sure that you do not merge such changes, because they are very NOT backwards compatible!
You need to disable heartbeat. Otherwise, your flame graphs will have extraneous garbage in them.

Next Steps

Prime Candid's cache by calling ListNeuronsResponse::ty() in post_upgrade, as suggested by Yan.
I'm not using all the space that I allocated to ic-wasm instrument. Before, when I tried to use all of it, I think it stomped on some other VirtualMemory's. (I guess the ideal solution is for ic-wasm instrument to support MemoryManager + VirtualMemory, rather than ask for raw stable memory space, but this might be difficult.)
Check the thread in the #dev channel where I am mainly talking to Yan Chen about getting this to work. I might be able to mine that thread for additional tasks to put into this "Next Steps" section...
Make as much of this sharable as possible, because every canister developer wants this (not just NNS).

Analysis

One of the things that the bazel test command outputs is instructions vs. neurons. Using linear regression, I developed a very strong model (R^2 = 0.973):

Instructions = (1.03 * HeapNeurons + 5.08 StableMemoryNeurons + 47.5) * 1e5

This tells us that the incremental cost of fetching and serving a neuron from heap is about 5x as much as it is from stable memory.

This is also telling us that the fixed cost/overhead of list_neurons is about 4.75 million instructions.

See attached Jupyter notebook for source code: list_neurons_instructions_vs_neuron_count.ipynb.txt

Observations

When the model underestimates, the principal tends to have many neurons in stable memory. Likewise, when the model overestimates, the principal tends to have very few neurons in stable memory. Not sure why that is. This does not really make sense to me... (warning: I'm not a data scientist.)

Anedotally

Seems like there can be BIG variance in the amount of instructions needed to load a neuron from stable memory. In some cases, it seems to take 100k instructions. In other cases, it seems to take 1.5M. It seems like what's happening here is that for "fast" stable memory neurons, theres no time spent fetching auxiliary data, but for "slow" stable memory neurons, most of the time fetching the neuron is spent loading auxiliary data. E.g. the neurons of principal ID an2fu-wmt2w-7fhvx-2g6qp-pq5pc-gdxu3-jfcr5-epiid-qhdzc-xnv3a-mae. We might want to dig into this, but I don't think there's time for that if we want to wrap up this investigation this week...

I think principal ID rxnbx-mhkez-7ckve-yboel-dnguy-dos76-hpdlg-fz5bf-xm5ou-72fre-wae is a good example of some (unsurprising) phenomena that seem to generalize:

A large chunk of instructions is spent in a couple of areas:
Looking for the caller's neurons. This requires a range StableBTreeMap range scan. We maybe could speed this up if (some) of the index were in heap. It might make sense to split the index so that neurons in heap are also indexed in heap. However, this would introduce complexity.
When a neuron is in stable memory, it takes much longer to load compared to heap (already noted earlier when discussing the linear model).
Candid serialization of the response. This is significant when most of a principal's neurons are in heap. (This fact was hinted at earlier when the overhead of list_neurons was discussed in the context of the linear model.)

Possible Solutions (not necessarily recommended!)

These three areas seem to be outside the control of the NNS. We maybe could speed up our use of stable memory by using a more efficient serialization scheme, but that would introduce more complexity.

When Rust + WASM 64 support becomes mature, that might also help us avoid the cost of stable memory by keeping more data in heap. This has an important drawback though: we would have to "freeze dry" heap data when upgrading. If there is a large amount of this, upgrades become expensive. In any case, it does not seem like Rust + WASM 64 support is coming to the IC soon.

Examples

https://drive.google.com/file/d/1eBEuEKtKlmw4hpDkFYgrkX-h40jr9s_H/view?usp=sharing

💡 You will have to download this in order to be able click on boxes in this example ☝️ to zoom in on those boxes, because the Google Drive viewer does not seem to support this.

Related Work

Max's range_neurons optimization. IIUC, this is currently only used in background (data validation) tasks, so it would not help us here.

Sep 26 '24 20:09 daniel-wong-dfinity-org