inferno Optimize implementation for when there are many frames

#99 identified that performance drops significantly when there are many frames in the output (see https://github.com/jonhoo/inferno/pull/99#issuecomment-473726169). We should figure out why that is, and fix it! Maybe it's time to flamegraph inferno ;)

Mar 18 '19 23:03 jonhoo

As a start to this, I profiled and generated a flame graph of running inferno-flamegraph with --reverse to create many frames. It looks like a lot of time is spent in svg::write_str and filled_rectangle. Both of these spend most of the time adding attributes. I think implementing #50 will reduce the number of attributes written?

Very little time is spent in merge::frames. It's really just all of the attribute creation and formatting/writing the SVG tags that's taking most of the time.

flamegraph.svg.zip

Mar 21 '19 02:03 jasonrhansen

Hmm, that push_attribute stack looks suspicious. Specifically, it seems to spend a bunch of time reserving additional memory in Vec::extend, which I thought we got rid of that in #58 (which fixed #41). @JordiChauzi any idea why that might be happening?

There's also a decent amount of time spent in number formatting there. I wonder whether we can use dtoa to significantly speed that up? Or alternatively if @bcmyers' num-format crate now has support for floating point, that may be able to take that job?

The buffered writes to output we can't do much about, though you are right that #50 should help a bit with that, since there'll just be less stuff to write overall.

Mar 21 '19 02:03 jonhoo

svg::write_str wasn't reusing a cached svg::Event like everything else. I submitted #108 to fix this.

Mar 21 '19 15:03 jasonrhansen

So I guess now that #108 and #110 have landed, we're probably down to number formatting being the biggest one?

Mar 24 '19 23:03 jonhoo

Yes, I think so, but I'm not sure what we should do to improve it. Even though dtoa would be faster, I don't think it will allow us to format the way we need it (with at most two fractional digits).

Mar 25 '19 04:03 jasonrhansen

Maybe @dtolnay or @bcmyers can give us a hint here? ^

Mar 25 '19 13:03 jonhoo

If the bottleneck is formatting floats with at most two digits after the decimal point, you will want to fork one or both of dtoa or ryu and specialize it for that representation. I don't know which one would turn out faster for this use case.

Is the input f32 or f64? Can you provide more detail about the exact output format you need? Never exponential notation?

The ryu code is more approachable -- there is an incomprehensible piece that converts a float into mantissa and base-10 exponent which you can pretty much use unchanged, and a comprehensible piece that formats to text, of which we already have two alternative representations for you to base on.

The dtoa code would be less approachable but potentially faster if you find the right place to stick an early return when the necessary precision is reached. With ryu it is less feasible to avoid generating full precision in the mantissa.

Mar 25 '19 14:03 dtolnay

@dtolnay all the inputs are f64.

I did some digging to provide some more detail, and here are the places we do float formatting in a tight loop:

https://github.com/jonhoo/inferno/blob/48506e8c9587a16885b240daac685b82c9119d35/src/flamegraph/svg.rs#L246-L247

https://github.com/jonhoo/inferno/blob/48506e8c9587a16885b240daac685b82c9119d35/src/flamegraph/mod.rs#L534

There's also this, though I'm a little less worried about this one since it's only used for differentials. We could also probably work around the :+ in a different way.

https://github.com/jonhoo/inferno/blob/48506e8c9587a16885b240daac685b82c9119d35/src/flamegraph/mod.rs#L550

So the format output is actually pretty straightforward. The numbers are also all fairly small, so I don't think we'll ever need exponential notation. Just returning at the right precision in something like dtoa seems like a great way to go about it. I guess it means we'll probably first do a write of the float into the StrStack, and then write that string into the final output.

We actually also spend a bit of time formatting integers in

https://github.com/jonhoo/inferno/blob/48506e8c9587a16885b240daac685b82c9119d35/src/flamegraph/mod.rs#L772-L775

But that's probably not worth doing anything about for the time being.

Mar 25 '19 14:03 jonhoo

It does seem like dtoa (or perhaps ryu) is the way to go here.

Are all the floating point numbers pretty small (meaning not that many digits to the left of the decimal)? If that's the case, you don't really need comma separators and therefore don't really need num-format. Even if num-format could do floats (it can't right now), it would necessarily be slower than dtoa or ryu.

If my assumption is wrong and you would really like comma separators for some reason, I'd be happy to add floating point support to num-format so you could use it here instead, but, again, it can never be as fast as dtoa or ryu. It might even be significantly slower than those two if I can't come up with some kind of algorithm that a) is similar to either dtoa or ryu and b) let's one slot in comma separators in the right places. I haven't had the time to dig into that question, but it doesn't appear obviously doable, meaning there's a possibility that "floating point support" for num-format ends up meaning doing something really naive, like calling to_string and then inserting commas in the right places or something equally as awful from a speed perspective.

Mar 25 '19 17:03 bcmyers

Hmm, actually, yeah, you're right, we specifically don't want comma separators, since these are used as px units in the browser. We only want the decimal point. So num-format is probably out of the question. It sounds like a modified dtoa is what we want then (or maybe we could send a PR to dtoa to make "truncate decimals" an option).

Mar 25 '19 17:03 jonhoo

Assuming truncation is all you need (not rounding), something like the write_truncated function below might get you a speed-up without having to fork / modify dtoa. It probably wouldn't be as fast as modifying dtoa, but it might be faster than the current implementation because it leverages dtoa.

use std::io;

use rand::Rng;
use str_stack::StrStack;

fn main() -> Result<(), io::Error> {
    let mut rng = rand::thread_rng();
    let mut stack = StrStack::new();
    let mut buf = Vec::new();
    for _ in 0..100 {
        let f = rng.gen::<f64>() * 1000.0;
        write_truncated(&mut stack, &mut buf, f, 2)?;
    }
    println!("{:?}", &stack);
    Ok(())
}

fn write_truncated(
    stack: &mut StrStack,
    mut buf: &mut Vec<u8>,
    f: f64,
    precision: usize,
) -> Result<(), io::Error> {
    buf.clear();
    dtoa::write(&mut buf, f)?;
    let decimal_position = buf
        .iter()
        .position(|b| *b == '.' as u8)
        .expect("should always contain a '.'");
    let end = decimal_position + precision + 1;
    while buf.len() < end {
        buf.push('0' as u8);
    }
    let slice = &buf[..end];
    let s = unsafe { std::str::from_utf8_unchecked(slice) };
    stack.push(s);
    Ok(())
}

Mar 25 '19 19:03 bcmyers

That's a good question -- I don't actually know what {:.2} does -- round, truncate, or something weirdly in-between. It's also not clear to me that it matters for coordinates.. @jasonrhansen ?

Mar 25 '19 19:03 jonhoo

It looks like {:.2} rounds instead of truncates. I'm not familiar enough with the code base to know if we need rounding or can just get away with truncation.

That said, if we do not need rounding, here's a branch that uses dtoa and truncation that should be (I haven't benchmarked it) faster than the current implementation.

It appears to pass all the tests except those that it fails because rounding is different than truncation.

Perhaps someone who knows more about the benchmark setup you guys are using could see if this does indeed give a speed-up. If so, we can clean it up a little bit and perhaps put it in as a stop-gap on the road to modifying dtoa (assuming, of course, that truncation ends up being OK).

If it's worth pursuing (it may not be), I can create a PR to start collaborating on how to clean up this "proof-of-concept."

Branch: https://github.com/bcmyers/inferno/tree/float-formatting Diff with master: https://github.com/bcmyers/inferno/commit/935fd3b265de8d332ccff6409aa7bb87fec9626b

Mar 25 '19 20:03 bcmyers

It probably doesn't really matter if we truncate instead of round. It will slightly diverge in behavior from the Perl version, but if the performance increase is large enough, I'd say it's worth it. Let me do some benchmarking.

Mar 25 '19 22:03 jasonrhansen

When I run the benchmarks with the default options, it doesn't look like there's much of an improvement.

flamegraph/dtrace       time:   [991.36 us 992.91 us 994.62 us]
                        thrpt:  [76.382 MiB/s 76.513 MiB/s 76.633 MiB/s]
                 change:
                        time:   [-1.6386% -1.2175% -0.7921%] (p = 0.00 < 0.05)
                        thrpt:  [+0.7985% +1.2325% +1.6659%]
                        Change within noise threshold.

flamegraph/perf         time:   [3.8639 ms 3.8762 ms 3.8930 ms]
                        thrpt:  [158.36 MiB/s 159.05 MiB/s 159.56 MiB/s]
                 change:
                        time:   [-1.9801% -1.4496% -0.8547%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8621% +1.4709% +2.0201%]
                        Change within noise threshold.

It's a little better when I use the reverse_stack_order option, which produces many more frames.

flamegraph/dtrace       time:   [4.7432 ms 4.7517 ms 4.7625 ms]
                        thrpt:  [15.952 MiB/s 15.988 MiB/s 16.017 MiB/s]
                 change:
                        time:   [-3.1295% -2.7839% -2.4623%] (p = 0.00 < 0.05)
                        thrpt:  [+2.5245% +2.8636% +3.2306%]
                        Performance has improved.
flamegraph/perf         time:   [25.980 ms 26.151 ms 26.329 ms]
                        thrpt:  [23.416 MiB/s 23.575 MiB/s 23.731 MiB/s]
                 change:
                        time:   [-2.7176% -2.2753% -1.8368%] (p = 0.00 < 0.05)
                        thrpt:  [+1.8711% +2.3283% +2.7935%]
                        Performance has improved.

Mar 25 '19 22:03 jasonrhansen

@jasonrhansen that's interesting -- are you swapping out all the cases I highlighted above where we do formatting? What shows up as the primary hot-spots if you flamegraph the benchmark now?

Mar 26 '19 01:03 jonhoo

I probably missed a couple of areas where we could speed up the code - Will do a more thorough pass sometime soon. Hopefully we can get more than a 1-2% speedup from this. Since hastily putting together the branch above I started writing a little standalone library with write_truncated and write_rounded functions that use dtoa so I could micro-benchmark them against write!(buf, "{:.2}", f) (i.e. against the standard library). It looks like we can speed up that specific function about 1.5-2x relative to the standard library (preliminary results are that a custom write_truncated takes about 65ns, a custom write_rounded takes about 80ns and the standard library equivalent (rounded) takes about 105ns). I ran out of time to continue working on this today, but am happy to continue to send you what I find in the coming days.

Mar 26 '19 01:03 bcmyers

Here's the flame graph with the changes @bcmyers made.

flamegraph.svg.zip

Mar 26 '19 03:03 jasonrhansen

That's interesting... From what I can tell, here are the current bottlenecks:

Writing to STDOUT: ~16%
quick_xml::write_event: ~13%
StrStack::write: ~12%
Frame merging: ~12% (a decent amount here is hashing, may be sped up by fnv in #113)
write_truncated (~7%) + write_truncated2 (~5%) ~= 12%
BytesStart::extend_attributes: ~11% (still some allocation happening in filled_rectangle?)
Formatting integers: ~7%

It certainly seems like there aren't any obvious optimization targets that would give us huge wins here. It's all ~10%. I think the best candidates are likely 4, 5, and 7. 6 we should mostly have dealt with at this point, though there are still some allocations in filled_rectangle that I don't think should be showing up.

Mar 26 '19 14:03 jonhoo

I tried using FnvHashMap for frame merging, but it didn't seem to make a difference.

Mar 26 '19 18:03 jasonrhansen

Interesting.. I wonder if hashbrown + fnv would...

Mar 26 '19 19:03 jonhoo

Something is still very weird here... Looking at the performance results in https://github.com/jonhoo/inferno/pull/114, --reverse is still significantly slower than without reverse. We should probably do a flamegraph diff between them to see where the extra time is spent.

Mar 26 '19 20:03 jonhoo

This is a differential flame graph comparing with and without --reverse.

diff.svg.zip

Mar 26 '19 21:03 jasonrhansen

That's super weird. I don't understand why it's so much slower given that profile :/

Mar 26 '19 23:03 jonhoo

Well both flamegraph::filled_rectangle and svg::write_str have around 20x the number of samples when using --reverse.

Function	Samples	Samples `--reverse`
flamegraph::filled_rectangle	242	4,528
svg::write_str	282	5,741

Mar 27 '19 01:03 jasonrhansen

Yeah, but the file is also larger. I guess there are more calls to the expensive drawing functions relative to the number of lines in the file because there is just less frame merging..

Mar 27 '19 02:03 jonhoo

I benchmarked with hashbrown + fnv, but no change in performance was detected.

Mar 28 '19 17:03 jasonrhansen

inferno inferno copied to clipboard

Optimize implementation for when there are many frames

inferno
inferno copied to clipboard