elfo icon indicating copy to clipboard operation
elfo copied to clipboard

Delays with message delivery to different actor groups

Open darthur11 opened this issue 10 months ago • 4 comments

Hi @loyd,

Last time I've asked about topologies and you've helped a lot. I'm kind of fan of elfo, but at the moment having an issue with message delivery delays.

Let's me start from the short intro. We have a topology which consists of several actor groups. Let's discuss these 5:

pub fn topology(.....) -> elfo::Topology {
    decision_executor.route_to(&ex1_order_canceller_buy, |envelope| {
        msg!(match envelope {
            protocol::orders::OrderToCancel => true,
            _ => false,
        })
    });
    decision_executor.route_to(&ex1_order_canceller_sell, |envelope| {
        msg!(match envelope {
            protocol::orders::OrderToCancel => true,
            _ => false,
        })
    });
    decision_executor.route_to(&ex2_order_canceller_buy, |envelope| {
        msg!(match envelope {
            protocol::orders::OrderToCancel => true,
            _ => false,
        })
    });
    decision_executor.route_to(&ex2_order_canceller_sell, |envelope| {
        msg!(match envelope {
            protocol::orders::OrderToCancel => true,
            _ => false,
        })
    });
}

Then we're running them like that:

async fn main() -> Result<()> {
elfo::init::start(topology(...)).await;
}

And here is the log on Trace level, so we have all messages showed here as well:

Trace level logs:
2025-02-04 18:31:24.954678000 TRACE [7536561770660367745] decision_executor._ - > OrderToCancel { position_id: "5b7f3819-3a66-4393-87d7-f5ead2721684", order_id: "327a000083219", quantity: 0.010, price: 99023.5, is_entry: false, account_side: Buy }	_location=elfo-core-0.2.0-alpha.16/src/context.rs:429
2025-02-04 18:31:24.954789000 TRACE [7536561770660367745] exchange_1_order_canceller_sell._ -  OrderToCancel { position_id: "5b7f3819-3a66-4393-87d7-f5ead2721684", order_id: "327a000083219", quantity: 0.010, price: 99023.5, is_entry: false, account_side: Buy }	_location=elfo-core-0.2.0-alpha.16/src/context.rs:429
2025-02-04 18:31:24.955017000 TRACE [7536561770660367745] exchange_2_order_canceller_sell._ -  OrderStates { statuses: [OpenOrder(OpenOrder { exchange: exchange_1, position_id: "5b7f3819-3a66-4393-87d7-f5ead2721684", order_id: "327a000083219", symbol: "BTC_USD", side: Sell, order_type: Limit, price: 99023.5, filled_price: 99023.5, quantity: 0.010, filled_quantity: 0, is_entry: false, account_side: Buy, exchange_time: 1738693884269000 }), OpenOrder(OpenOrder { exchange: exchange_1, position_id: "5b7f3819-3a66-4393-87d7-f5ead2721684", order_id: "32754014154", symbol: "BTC_USD", side: Sell, order_type: Limit, price: 99024, filled_price: 99024, quantity: 0.010, filled_quantity: 0, is_entry: false, account_side: Buy, exchange_time: 1738693884269000 })] }	_location=elfo-core-0.2.0-alpha.16/src/context.rs:429
2025-02-04 18:31:25.044390000 TRACE [7536561770660367750] exchange_1_order_statuses_aggregator._ - > OrderStates { statuses: [OpenOrder(OpenOrder { exchange: exchange_1, position_id: "5b7f3819-3a66-4393-87d7-f5ead2721684", order_id: "327a000083219", symbol: "BTC_USD", side: Sell, order_type: Limit, price: 99023.5, filled_price: 99023.5, quantity: 0.010, filled_quantity: 0, is_entry: false, account_side: Buy, exchange_time: 1738693884269000 }), OpenOrder(OpenOrder { exchange: exchange_1, position_id: "5b7f3819-3a66-4393-87d7-f5ead2721684", order_id: "32754014154", symbol: "BTC_USD", side: Sell, order_type: Limit, price: 99024, filled_price: 99024, quantity: 0.010, filled_quantity: 0, is_entry: false, account_side: Buy, exchange_time: 1738693884269000 })] }	_location=elfo-core-0.2.0-alpha.16/src/context.rs:429
2025-02-04 18:31:25.044406000 TRACE [7536561770660367750] collectors.general - 

The main issue we're having right now is that messages are reached different actor groups with a significant delay.

  • We generated message at: 2025-02-04 18:31:24.954678000
  • Then it reached first canceller at 2025-02-04 18:31:24.954789000 (which is okay, since only 111 micros spent)
  • Second canceller has been reached at 2025-02-04 18:31:24.954886000 (which is normal, cause we spent 208 micros to reach second actor group)
  • Third has been reached at 2025-02-04 18:31:25.094191000 (which is bad, cause it took 139513 micros - 139 millis)
  • And finally, fourth has been reached at 2025-02-04 18:31:25.171032000 (which is disappointing cause it took 216354 micros to finally reach fourth actor group)

I can't understand why it's happening, cause in my understanding messages should reach destination in sub millisecond timeframe. Could you please suggest any ideas on debugging/profiling the code?

P.S. Here we are not using Arc to send only references, but we're pushing the whole struct into ctx. Maybe that could be an issue.

Thanks a lot!

darthur11 avatar Feb 05 '25 13:02 darthur11

@darthur11 sorry for the late response.

Assuming you're compiling in the release mode, none of your latencies are okay if destination actors aren't overloaded. Expected latency (if the destination is empty, the trace level logging is disabled, and mimalloc or tcmalloc allocators are used) is around one micro timeframe (and sub-micro if the destination actor is running already, so there is no need to wake up tokio task).

However, if the destination actor is overloaded (so there are spikes in the number of messages in the mailbox), this latency increases (Little's law).

Elfo provides different metrics to localize problems:

  1. elfo_message_waiting_time_seconds shows how long messages wait in mailboxes. High values usually mean that the destination actor is busy; more likely, you will see large numbers here.
  2. elfo_message_handling_time_seconds shows (with per-message labeling) how long messages are handled.
  3. sum(rate(elfo_message_handling_time_seconds_sum)) shows the saturation. Without sum you can see per-message saturation to understand which messages make sense to optimize.
  4. elfo_busy_time_seconds shows how long task's poll() is running. rate(elfo_busy_time_seconds_sum) shows CPU usage of actor.

More info in the actoromicon.

So, I recommend checking these metrics to understand whether the destination actor is overloaded or not.


Also, your routing table is a bit strange. I would expect one group for all order cancellers with exchange as an actor key instead of sending the same message to multiple groups.

loyd avatar Feb 12 '25 07:02 loyd

Wow, thank you very much for explanation @loyd . I will dig dive more into metrics to localize the problem. Also thought, that reasonable time for message to reach destination should be in micros or even nanos, not millis for sure. As for topology you're absolutely right and we're planning to use Actor group + exchange key.

If you don't mind, I will keep this issue for a few days, in case I have more comments/questions on the topic.

darthur11 avatar Feb 13 '25 15:02 darthur11

I will keep this issue for a few days

Yes, sure, feel free to ask more; I'll be happy to help you.

should be in micros or even nanos, not millis for sure

Definitely

we're planning to use Actor group + exchange key

To preserve detailed metrics per exchange, I recommend enabling the per_actor_key telemetry (https://docs.rs/elfo/0.2.0-alpha.17/elfo/config/system/telemetry/struct.TelemetryConfig.html#structfield.per_actor_key) to make metrics per actors. It's useful for small groups.


Also, do you use prometheus/victoriametrics to collect metrics? I can share the Grafana dashboard (part of it you can see on the main page of https://actoromicon.rs/), which is very useful for understanding what's going on with actors because it shows all these important metrics.

loyd avatar Feb 15 '25 08:02 loyd

Yeah, I use Prometheus for Grafana for basic metrics, but having your example could help a lot as well. Can get more inspiration about metrics.

Thanks!

darthur11 avatar Feb 16 '25 11:02 darthur11