warp
warp copied to clipboard
Composing `Or` filters as an imbalanced tree causes signficialy longer build times
In a project I'm working on, I was able to cut debug build times by ~65%, from 6 minutes to 2 minutes, by building a more balanced Or tree of filters. Note that I optimized this code for readability and did not make the tree as balanced as it could be, an optimally balanced tree should yield even better results.
It might be useful to others it this was documented somewhere. Another thing that could make things easier is for warp to provide an or! macro that takes a flat list of filters and automatically builds an optimally balanced tree for them, in a way that preserves the filters order priority.
Having some sort of macro would be ideal. I'm not sure it's possible to implement in a macro_rules!, unfortunately.
Right, I was also wondering if its actually possible to implement this logic with a macro or not...
But it could also be a regular function and not a macro, which should yield pretty similar results, right?
How could it be implemented as a function? Unless you're taking dyn args, you'd have to accept a Vec (which would restrict types) to allow for an arbitrary number of parameters. The only other way I can think is to implement some trait for tuples up to a certain size, where the or-ing is manually optimized for each.
The latter might work? It's not a terrible idea, honestly.
Hi, this also helps easing on compile times, check https://github.com/seanmonstar/warp/pull/273 and https://github.com/seanmonstar/warp/issues/507#issuecomment-615974062
@jhpratt nerd-sniped me with their comment above that "I'm not sure it's possible to implement in a macro_rules!, unfortunately." I made a macro to build a balanced tree of or filters. (Here is a playground link to a version that just builds node trees if you want to play with it.)
/// Via: https://gist.github.com/durka/9fc479de2555225a787f
/// Modified to handle cases with an odd number of exprs and call
/// the callback macro on left and right separately
///
/// strategy:
/// - first copy the list of exprs -- the copy will just be used as a counter
/// - then, starting with all the exprs in a "right" list and an empty "left" list:
/// - move one expr at a time from the start of "right" to the end of "left", and decrement the counter _twice_
/// - when the counter is zero, the halving is complete
/// - if there were an odd number of exprs, the left hand result will have one more expr than the right
/// the macro is in continuation passing style, so you pass in a macro invocation and it will be called
/// once with both "left" and "right" (each enclosed in square brackets) inserted as the last arguments
macro_rules! _split_evenly {
// internal rules
// these rules split the exprs into "left" and "right"
// - finished: hand off to the continuation macro
(@split [$mac:ident!($($arg:tt),*)] $left:tt $right:tt [])
=> { $mac!($($arg,)* ($left, $right)) };
// - counter = 1 or 2
(@split $mac:tt [$($left:expr),*] [$head:expr, $($tail:expr),+] [$a:expr $(,$b:expr)?])
=> { _split_evenly!(@split $mac [$($left,)* $head] [$($tail),+] []) };
// - counter > 2
(@split $mac:tt [$($left:expr),*] [$head:expr, $($tail:expr),+] [$a:expr, $b:expr, $($more:expr),+])
=> { _split_evenly!(@split $mac [$($left,)* $head] [$($tail),+] [$($more),+]) };
// external entry point rule
($mac:ident!($($arg:tt),*): $($x:expr),*)
=> { _split_evenly!(@split [$mac!($($arg),*)] [] [$($x),*] [$($x),*]) }
// ^ continuation ^ left ^ right ^ counter
}
macro_rules! balanced_or_tree {
// callback from _split_evenly
(_callback, ([$($x:expr),*], [$($y:expr),*]))
=> { (balanced_or_tree!($($x),*)).or(balanced_or_tree!($($y),*)) };
// entrypoints
// one expression (so e.g. commenting all but one arg works)
($x:expr $(,)?) => { $x };
// multiple expressions
($($x:expr),* $(,)?) => { _split_evenly!(balanced_or_tree!(_callback): $($x),*) };
}
I tested this on @shesek's repo, and this macro invocation:
let handlers = balanced_or_tree!(
hd_wallets_handler,
hd_wallet_handler,
hd_key_handler, // needs to be before spk_handler to work fpr keys that don't have any indexed history
hd_gap_handler,
hd_next_handler,
spk_handler,
spk_utxo_handler,
spk_stats_handler,
spk_txs_handler,
spk_txs_compact_handler,
tx_handler,
tx_verbose_handler,
tx_hex_handler,
tx_proof_handler,
txs_since_handler,
txs_since_compact_handler,
tx_broadcast_handler,
txo_handler,
utxos_handler,
sse_handler,
spk_sse_handler,
block_tip_handler,
block_header_handler,
block_hex_handler,
block_height_handler,
mempool_histogram_handler,
fee_estimate_handler,
dump_handler,
debug_handler,
sync_handler,
warp::any().map(|| StatusCode::NOT_FOUND)
)
.with(warp::log("bwt::http"))
.with(warp::reply::with::headers(headers));
turned into this code:
let handlers =
(((((hd_wallets_handler).or(hd_wallet_handler)).or((hd_key_handler).or(hd_gap_handler)))
.or(((hd_next_handler).or(spk_handler)).or((spk_utxo_handler).or(spk_stats_handler))))
.or((((spk_txs_handler).or(spk_txs_compact_handler))
.or((tx_handler).or(tx_verbose_handler)))
.or(((tx_hex_handler).or(tx_proof_handler))
.or((txs_since_handler).or(txs_since_compact_handler)))))
.or(
((((tx_broadcast_handler).or(txo_handler)).or((utxos_handler).or(sse_handler)))
.or(((spk_sse_handler).or(block_tip_handler))
.or((block_header_handler).or(block_hex_handler))))
.or((((block_height_handler).or(mempool_histogram_handler))
.or((fee_estimate_handler).or(dump_handler)))
.or(((debug_handler).or(sync_handler)).or(warp::any().map(|| StatusCode::NOT_FOUND)))),
)
.with(warp::log("bwt::http"))
.with(warp::reply::with::headers(headers));
Basically, it seems to work. I observed the following incremental build times on my 3900X when switching between the different approaches:
before improvement
$ time cargo +nightly build
Compiling bwt v0.1.3 (/repos/bwt)
Finished dev [unoptimized + debuginfo] target(s) in 2m 08s
real 2m8.545s
user 2m18.774s
sys 0m1.698s
after manual improvement (commit above)
$ time cargo +nightly build
Compiling bwt v0.1.3 (/repos/bwt)
Finished dev [unoptimized + debuginfo] target(s) in 47.51s
real 0m47.528s
user 0m57.800s
sys 0m1.332s
original code but converted to balanced_or_tree
$ time cargo +nightly build
Compiling bwt v0.1.3 (/repos/bwt)
Finished dev [unoptimized + debuginfo] target(s) in 35.89s
real 0m35.912s
user 0m40.395s
sys 0m1.340s
I think this is enough of an improvement that it's worth considering polishing this and bringing it into warp.
Edit: improved macro, playground link
I'll definitely take a look to attempt to understand how that works, but it certainly looks promising. Maybe someday I'll be able to write complex macro_rules! like that myself.
@seanmonstar do you think this is a sensible candidate to include in warp proper?
I'll definitely take a look to attempt to understand how that works, but it certainly looks promising. Maybe someday I'll be able to write complex macro_rules! like that myself.
For what it's worth, this the first macro I've written with this complexity, I've written a few "repeat this with some things changed" macros before but that's it. The biggest help in writing this was having a well documented "split a list" macro already shared (see the link at the top of the doc comment). I tried to keep the comments on the macro up to date with my tweaks.
Someone with more experience in macro_rules! may be able to spot some issues or improvements. For example, I bracket the first expression in the or pair - I'm not sure if there's any point in doing so, but I'm trying to play it safe for the general case.
I've just simplified the macro provided by @SafariMonkey. It only has four rules, two of which are internal. While likely having near zero impact on the actual performance, I've changed the bracket delimiters to semicolons, reducing the number of tokens. Additionally, everything is boxed in debug mode, which significantly improves compilation speed. This can be trivially disabled by removing the debug_boxed! call in the first arm.
macro_rules! balanced_or_tree {
($x:expr $(,)?) => { debug_boxed!($x) };
($($x:expr),+ $(,)?) => {
balanced_or_tree!(@internal ; $($x),+; $($x),+)
};
(@internal $($left:expr),*; $head:expr, $($tail:expr),+; $a:expr $(,$b:expr)?) => {
(balanced_or_tree!($($left,)* $head)).or(balanced_or_tree!($($tail),+))
};
(@internal $($left:expr),*; $head:expr, $($tail:expr),+; $a:expr, $b:expr, $($more:expr),+) => {
balanced_or_tree!(@internal $($left,)* $head; $($tail),+; $($more),+)
};
}
#[cfg(debug_assertions)]
macro_rules! debug_boxed {
($x:expr) => {
::warp::Filter::boxed($x)
};
}
#[cfg(not(debug_assertions))]
macro_rules! debug_boxed {
($x:expr) => {
$x
};
}
I like what you've done with it. The inlining of the splitting logic is clever, and having fewer rules is nice.
Unfortunately, I think that readability has somewhat suffered from the mixing of concerns and the lack of documentation on the macro or the individual rules. I understand them because I spent significant time working with the original, but I gained a lot from the documentation on that macro. I would strongly recommend adding some comments to this macro explaining the purpose of each rule, especially the internal rule flow, in a similar manner to how the original macro was documented.
But overall, I'm glad to see progress on this. Let me know if there's anything you'd like help with.
Of course documentation isn't present in the sample — I still hardly understand how the macro works. I basically just spent a little time hacking away to inline the splitting and reducing the number of rules. With documentation, I think this would be a sensible addition to warp.
I went ahead and added some comments from my understanding of the code. I don't know how much of the commentary is helpful/needed, so if you think any of it is unnecessary or overly verbose, feel free to remove or amend it. I thought it was probably better to braindump now and polish it later.
/// Takes a list of handler expressions and `or`s them together
/// in a balanced tree. That is, instead of `a.or(b).or(c).or(d)`,
/// it produces `(a.or(b)).or(c.or(d))`, thus nesting the types
/// less deeply, which provides improvements in compile time.
///
/// It also applies `::warp::Filter::boxed` to each handler expression
/// when in `debug_assertions` mode, improving compile time further.
//
// The basic list splitting algorithm here is based on this gist:
// https://gist.github.com/durka/9fc479de2555225a787f
// It uses a counter from which two items are removed each time,
// stopping when the counter reaches 0. At each step, one item
// is moved from the left to the right, and thus at the end,
// there will be the same number of items in each list.
//
// The flow is as follows:
// - If there is one handler expression, debug_box it and return.
// - If there is more than one handler expression:
// - First, copy the list into two: the one that will go into the
// right side of the `or`, and one that will serve as a counter.
// Recurse with these separated by semicolons, plus an empty `left`
// list before the first semicolon.
// - Then, as long as there are at least two items in the counter
// list, remove them and move the first item on the right side of
// the first semicolon (`head`) to the left side of the first semicolon.
// - Finally, when there are one or zero items left in the counter,
// move one last item to the left, make the call this macro on both the
// left and right sides, and `or` the two sides together.
//
// For example, balanced_or_tree!(a, b, c, d, e) would take the following steps:
//
// - balanced_or_tree!(a, b, c, d, e)
// - balanced_or_tree!(@internal ; a, b, c, d, e ; a, b, c, d, e) // initialise lists
// - balanced_or_tree!(@internal a ; b, c, d, e ; c, d, e) // move one elem; remove two
// - balanced_or_tree!(@internal a, b ; c, d, e ; e) // now only one elem in counter
// - balanced_or_tree!(a, b, c).or(balanced_or_tree(d, e)) // recurse on each sublist
macro_rules! balanced_or_tree {
// Base case: just a single expression, return it wrapped in `debug_boxed`
($x:expr $(,)?) => { debug_boxed!($x) };
// Multiple expressions: recurse with three lists: left, right and counter.
($($x:expr),+ $(,)?) => {
balanced_or_tree!(@internal ; $($x),+; $($x),+)
// ^ left ^ right ^ counter
};
// Counter 1 or 2; move one more item and recurse on each sublist, and or them together
(@internal $($left:expr),*; $head:expr, $($tail:expr),+; $a:expr $(,$b:expr)?) => {
(balanced_or_tree!($($left,)* $head)).or(balanced_or_tree!($($tail),+))
};
// Counter > 2; move one item from the right to the left and subtract two from the counter
(@internal $($left:expr),*; $head:expr, $($tail:expr),+; $a:expr, $b:expr, $($more:expr),+) => {
balanced_or_tree!(@internal $($left,)* $head; $($tail),+; $($more),+)
};
}
LGTM.
@SafariMonkey @jhpratt Wow, this is pretty amazing, thank you!
Awesome work!
I think it is worth putting into example directory.
Thanks, this one reduced my debug compilation times from 4+ minutes to 28 seconds. I had to extract all handlers to separate variables to understand why the macro made the compiler angry initially, it turned out that the my websocket filter cannot be boxed, not sure why. I decided to share my experience, since this is the seconds time I try it, but it's worth it.
FWIW, I stumbled across this one, I think the underlying issue here may be this rustc issue: https://github.com/rust-lang/rust/issues/68666 (the pattern of the minimal reproducer looks remarkably similar at any rate!)
I have a warp 0.2.5 app with about 20 route handlers that was suffering 30+ minute incremental compile times. I had to rearrange my app quite a bit to use the macro since it doesn't seem to work if I pass it filter functions (like the ones in the todo example). I think this is similar to the issue @sashomasho mentioned. Once I converted the filter functions to straight variables in main.rs and passed them to the macro, I got a compile time of a little over 3 minutes. 10x speed up 😀 . Thanks for sharing the macro.
I do wish I could get it to under a minute, but it's definitely bearable now.
@SafariMonkey thanks so much for this macro! It reduced my debug compile times for thoughts.page from ~3m56s to ~14s, which is a huge relief.
It would be great to get a macro like this into warp — perhaps it would be best to split it into to two macros, one to balance or trees, and the other to take a whole tree and debug_box it?