lightyear Improve benchmark performance

Improve benchmark performance

Open cBournhonesque opened this issue 8 months ago • 8 comments

Benchmarks show that it takes 1.3 ms to replicate 1000 entities (replicon takes 30us). Why?

With a lot of tracing spans, it's 3ms (because of the tracing overhead):

send_entity_spawn takes 530us
- ReplicationSend::prepare_entity_spawn is 85us (because we are allocating. The overall memory should be re-used though. But the hashmap inside the replication_group_id is probably very inefficient?)
send_component_update is 686us
- ReplicationSend::prepare_component_insert is 178us
- rest is probably iteration + serialization?
networking::send is 1.13ms
- buffer_replication_message is 790us
  - finalize is 115us
  - then there is serializing (not tracked)
  - buffer_send_with_priority is 170us (buffering into the message manager)
- send_packets is only 335us
  - message_manager.send_packets (that collects the messages to send from channels and builds the packets) is 190us

Also here are the ChannelSendStats:

ChannelSendStats {
        num_single_messages_sent: 1000,
        num_fragment_messages_sent: 0,
        num_bytes_sent: 27000,
},

maybe it's not optimal to send a lot of individual messages, because we generate one MessageId per individual message?
here the stats don't even take into account the MessageId, it's just the raw message bytes. 27 bytes seems pretty steep!
- 1-2 bits for ReplicationMessage (but it shouldn't be needed because we are in the EntityActions Channel!)
- group_id = u64 = 8 bytes
- 1 bit for Action vs Updates
- 2 bytes for MessageId (the "sequence id" for the replication group)
- the length of the vec: encoded with GammeEncoding so probably 1 byte. (here 2 bits)
- the entity: u64 so 8 bytes
- SpawnAction: I think only 2 bits?
- insert: 2 bits for the length, 2 bytes for the ComponentNetId, 4 bytes for the float
- remove: 1 bit for the length of the empty hashset
- updates: 1 bit for the length of the vec = 24 bytes + 11 bits = 26 bytes (I don't know where the extra 1 byte comes from). That's pretty steep. The main reason is that Entity is 8 bytes. Maybe we could gamma-code it as 2 tuples? since both the index and generation should be pretty low? Another reason is that we encode both ReplicationId and Entity which are the same here.

Potential ideas:

send_entity_spawn
- uses a double hashmap to store data. In particular the allocated memory of the second hashmap cannot be recovered!
networking_send
- buffer_replication_message
  - we serialize twice because of bitcode quirks currently
  - we allocate new EntityActions/EntityUpdates message instead of re-using existing ones
  - serialize directly into a cursor without intermediate data structures? I'm not sure it's possible if we want to keep the ReplicationGroup guarantees, which replicon doesn't have Creating some entities ahead of time to re-use allocations in prepare_component_insert seems to bring a small (5%?) improvement. But what we can do is since we already buffered the per-replication data, the final message can be written to manually using a cursor
  - they write all entity-actions into one message, which might become big and have to be split up (bad if packet-loss). We have one message per ReplicationGroup. That's also why they can write in a cursor efficiently: they write all the despawns (with entity), then all the removals (with entity), then all insertions (with entity)
  - should we just update our message-packing?
  - all the component updates for an entity are iterated through sequentially (which shouldn't make a diff for this benchmark) so they can be serialized directly in order?

May 30 '24 04:05 cBournhonesque