automerge-classic
automerge-classic copied to clipboard
What should we benchmark?
Benchmarks
I've been working on the performance of the Rust automerge implementation and as a result have been thinking about how we want to benchmark ourselves. I want to build a benchmarking suite for automerge so that can be sure we are not introducing performance regressions as we iterate the algorithms and APIs we are building. I'm raising this issue to get some consensus on what we should measure.
Everything here is a strawman, please criticise strongly.
Categories of metrics
What does being performant mean for an Automerge implementation? @ept, @orion and I jumped on a call and we tentatively identified three top level categories of metrics we might care about:
- The latency of user edits. If the user is typing their edits should be immediately reflected in the UI and thus in the frontend. "immediately" here means comfortably within a 16ms budget on typical hardware so that 60 FPS UI is possible
- The latency of user initiated backend changes. This is things like loading or saving a document from a file or network location
- The latency of changes from other peers being reflected in the local document
The difference between these categories is that user edits must be much faster than backend changes. It's probably acceptable if changes from other peers take a few hundred milliseconds to appear in the UI, but local changes such as inserting text or drawing a freehand drawing needs to be imperceptibly fast.
For any metric there are two cases we care about. The multithreaded and single threaded cases. The multithreaded case is in general where we expect to see better performance but the single threaded case is still important because we expect the typical developer experience to be to start with the single threaded API (it is after all much simpler to avoid threads entirely) and then to move to multithreaded if performance is not acceptable.
Scenarios
Based on these categories I've tried to come up with some scenarios which I think capture the kinds of things we want to be fast at. Quite a few are inspired by the crdt-benchmarks
work done by Kevin Jahns of y.js
. For each scenario we should measure three things:
- The time taken to apply each change to the frontend and generate a corresponding local change to send to the backend (if applicable).
- The time taken to apply each resulting patch to the frontend
- The total from starting the scenario to the final patch being applied
Separating things this way allows us to measure the end to end latency, as well as ensuring that we never block the UI thread.
I also think each scenario should be measured both for top level and deeply nested keys on the basis that traversing the document could easily end up being an expensive operation if we are not careful and so we should make sure we're testing that it doesn't.
Scenario | Proxied user behaviour |
---|---|
Inserting N characters 1 at a time | User typing text into a document |
Inserting N characters all at once | User pasting text into a document |
Deleting N consecutive characters all at once | User selects and deletes some text |
Inserting N characters at random points in a text object | User typing text |
Inserting or deleting at a random point in a text object N times | User typing text |
Appending a map to a list of maps N times | User adding a complex value to a document in a continuous fashion, such as when adding the (x,y) coordinates of a free hand drawn path |
Inserting a map at random points in a list | User interacting with an application that manipulates complex data in a list, such as a todo list |
Inserting or deleting at a random point in a list N times | User modifying data in a complex application such as a todo list |
Inserting a timestamp N times | User generating timestamped data in a continuous fashion |
Loading a compressed document with N operations in it | User loading a saved document |
Loading a sequence of N changes, each with 1 operation in it | User loading a series of uncompressed, saved changes |
Loading a sequence of N conflicting changes to the same property, each with M operations in it | Receiving changes from peers over the network |
I would love to hear other peoples ideas about what we should be measuring!
This looks great, thanks for putting it together! One more suggestion for a scenario would be "deleting N consecutive characters all at once". This is an interesting case because it requires sending the IDs of all the deleted characters.
Besides the CPU time taken, it might also be interesting to record the size of the message sent over the network for each user operation. However, this is unlikely to change much once we've frozen the data format.
adding the (x,y) coordinates of a free hand drawn path
For such an application, rather than representing it as a list of maps with {x, y}
coordinates, I would suggest representing it as a flat list of numbers, alternating between x and y coordinates. That will probably be more compact and faster.
Good shout on the delete benchmark, I'll add that.
Re. the (x,y) coordinates. I can see that representing it as a flat list would be more efficient. Part of the reason I put that benchmark in there was because I wanted to capture the idea of complex continuous actions of some kind (such as drawing, or recording sound or something). In general we can't predict what data people might need to store so I figured a map was a good general purpose benchmark. I can see an argument that we should not be optimising for this (probably quite niche) use case and so just advising users to use a flat list approach makes sense, would that be your thinking?
I suppose my other thought about a flat list is that it doesn't really support concurrent modifications of the path very nicely. Although I guess inserting and removing pairs should still commute.
Fair point — happy to have a list of maps on the benchmark. Even if we advise against it, people will probably do it anyway, so we should try to make it fast 😀