Benchmarks

I've been working on the performance of the Rust automerge implementation and as a result have been thinking about how we want to benchmark ourselves. I want to build a benchmarking suite for automerge so that can be sure we are not introducing performance regressions as we iterate the algorithms and APIs we are building. I'm raising this issue to get some consensus on what we should measure.

Everything here is a strawman, please criticise strongly.

Categories of metrics

What does being performant mean for an Automerge implementation? @ept, @orion and I jumped on a call and we tentatively identified three top level categories of metrics we might care about:

The latency of user edits. If the user is typing their edits should be immediately reflected in the UI and thus in the frontend. "immediately" here means comfortably within a 16ms budget on typical hardware so that 60 FPS UI is possible
The latency of user initiated backend changes. This is things like loading or saving a document from a file or network location
The latency of changes from other peers being reflected in the local document

The difference between these categories is that user edits must be much faster than backend changes. It's probably acceptable if changes from other peers take a few hundred milliseconds to appear in the UI, but local changes such as inserting text or drawing a freehand drawing needs to be imperceptibly fast.

For any metric there are two cases we care about. The multithreaded and single threaded cases. The multithreaded case is in general where we expect to see better performance but the single threaded case is still important because we expect the typical developer experience to be to start with the single threaded API (it is after all much simpler to avoid threads entirely) and then to move to multithreaded if performance is not acceptable.

Scenarios

Based on these categories I've tried to come up with some scenarios which I think capture the kinds of things we want to be fast at. Quite a few are inspired by the crdt-benchmarks work done by Kevin Jahns of y.js. For each scenario we should measure three things:

The time taken to apply each change to the frontend and generate a corresponding local change to send to the backend (if applicable).
The time taken to apply each resulting patch to the frontend
The total from starting the scenario to the final patch being applied

Separating things this way allows us to measure the end to end latency, as well as ensuring that we never block the UI thread.

I also think each scenario should be measured both for top level and deeply nested keys on the basis that traversing the document could easily end up being an expensive operation if we are not careful and so we should make sure we're testing that it doesn't.

Scenario	Proxied user behaviour
Inserting N characters 1 at a time	User typing text into a document
Inserting N characters all at once	User pasting text into a document
Deleting N consecutive characters all at once	User selects and deletes some text
Inserting N characters at random points in a text object	User typing text
Inserting or deleting at a random point in a text object N times	User typing text
Appending a map to a list of maps N times	User adding a complex value to a document in a continuous fashion, such as when adding the (x,y) coordinates of a free hand drawn path
Inserting a map at random points in a list	User interacting with an application that manipulates complex data in a list, such as a todo list
Inserting or deleting at a random point in a list N times	User modifying data in a complex application such as a todo list
Inserting a timestamp N times	User generating timestamped data in a continuous fashion
Loading a compressed document with N operations in it	User loading a saved document
Loading a sequence of N changes, each with 1 operation in it	User loading a series of uncompressed, saved changes
Loading a sequence of N conflicting changes to the same property, each with M operations in it	Receiving changes from peers over the network

I would love to hear other peoples ideas about what we should be measuring!

Feb 02 '21 11:02 alexjg

This looks great, thanks for putting it together! One more suggestion for a scenario would be "deleting N consecutive characters all at once". This is an interesting case because it requires sending the IDs of all the deleted characters.

Besides the CPU time taken, it might also be interesting to record the size of the message sent over the network for each user operation. However, this is unlikely to change much once we've frozen the data format.

adding the (x,y) coordinates of a free hand drawn path

For such an application, rather than representing it as a list of maps with {x, y} coordinates, I would suggest representing it as a flat list of numbers, alternating between x and y coordinates. That will probably be more compact and faster.

Feb 02 '21 18:02 ept

Good shout on the delete benchmark, I'll add that.

Re. the (x,y) coordinates. I can see that representing it as a flat list would be more efficient. Part of the reason I put that benchmark in there was because I wanted to capture the idea of complex continuous actions of some kind (such as drawing, or recording sound or something). In general we can't predict what data people might need to store so I figured a map was a good general purpose benchmark. I can see an argument that we should not be optimising for this (probably quite niche) use case and so just advising users to use a flat list approach makes sense, would that be your thinking?

I suppose my other thought about a flat list is that it doesn't really support concurrent modifications of the path very nicely. Although I guess inserting and removing pairs should still commute.

Feb 03 '21 16:02 alexjg

Fair point — happy to have a list of maps on the benchmark. Even if we advise against it, people will probably do it anyway, so we should try to make it fast 😀

Feb 03 '21 16:02 ept

automerge-classic
automerge-classic copied to clipboard

What should we benchmark?

Benchmarks

Categories of metrics

Scenarios

automerge-classic automerge-classic copied to clipboard

What should we benchmark?

Benchmarks

Categories of metrics

Scenarios

automerge-classic
automerge-classic copied to clipboard