rage icon indicating copy to clipboard operation
rage copied to clipboard

Quadratic time complexity when parsing long recipient stanzas

Open DavidBuchanan314 opened this issue 2 years ago • 5 comments
trafficstars

Environment

  • OS: aarch64 Arch Linux
  • rage version: rage 0.9.0

What were you trying to do

Decrypt an age file (attached, with test key) with an unusually long (~4MB) recipient stanza body, followed by a normal X25519 stanza.

Expected outcome (it is a completely valid age file, and works fine with age):

$ age -d -i testkey.txt test.age
Hello, world!

(the above completes near-instantly)

What happened

It takes a near-infinite amount of time to parse, and progress gets exponentially slower. Using pv to monitor the progress like so:

$ pv test.age | rage -d -i testkey.txt
 768KiB 0:02:47 [0.00 B/s] [====>                           ] 18% ETA 0:12:17

(pv's ETA grows constantly)

Something similar also happens with a very long recipient type string, but that seems less likely to be a problem in a "real world" scenario, although you might still want to consider it a bug.

Edit: I just learnt about rage's concept of "greasing" - consider this an extreme form of grease, heh

long_recipient.zip

DavidBuchanan314 avatar Feb 19 '23 17:02 DavidBuchanan314

I see you've pulled out the high-viscosity grease with a lithium complex soap thickener 😆

The performance issue is almost certainly this parser: https://github.com/str4d/rage/blob/3ff541a0a59864b782aeaca4d33cee1c8d61c2e9/age-core/src/format.rs#L191-L218

My expectation is that full_chunks is reallocating every power-of-two lines. This is because the many_till parser is unaware of the length of the input data, so it:

  • Creates a res = Vec::new() allocating in a compiler-defined way (currently allocating nothing).
  • While a short line has not been found:
    • Find a full line.
    • res.push(full_line) which reallocates in a compiler-defined way, currently:
      • On first push to the empty Vec, allocating a "small vec" to avoid small reallocations.
      • After that, doubling on powers-of-two.

We do not know what comes after the stanza (so we can't just break on \n-> or \n---), as this parser is used for parsing both age headers and the v1 plugin IPC protocol. However, now that the age stanza format requires a final short line, we could probably write a variant of many_till that is aware of the newline-separated format, and uses the short-line parser to identify the last line of the wrapped data, and then counts the preceding lines to know what capacity to give the Vec.

str4d avatar Feb 19 '23 19:02 str4d

I measured a flamegraph, and indeed the perfomance issue is in wrapped_encoded_data, but it's not the allocations!

image

The bottleneck is this function: https://github.com/str4d/rage/blob/3ff541a0a59864b782aeaca4d33cee1c8d61c2e9/age-core/src/format.rs#L145-L152

For a given slice, is_base64_char is called on every byte twice (due to how many_till works). However, thinking about it, the cause of this behaviour is more fundamentally about my use of combinatorial parsers, and how they interact with io::Read: https://github.com/str4d/rage/blob/3ff541a0a59864b782aeaca4d33cee1c8d61c2e9/age/src/format.rs#L64-L89

Given an io::Read type, we don't have access to any external buffering, so we cannot over-read (or else the data after the header will be truncated), which means we can't use a buffered reader here. But the combinatorial parsers produce the entire header at once (so we avoid producing partial or invalid header states in memory). So the way we parse files is to run the parser until it tells us how many more bytes it needs, then we read that many bytes from the io::Read, and repeat.

The upshot is that as header length increases, the number of times we will run is_base64_char on the early bytes of the header increases.

I suspect the only ways to improve on this are:

  • Break the encapsulation of the current APIs, and make the outer header parser aware of the line-based nature of the header. Then instead of reading only the number of bytes that the parser asks for, we always read to the next newline (effectively buffering lines, like age does).
  • Break the type safety of the current APIs, and store partially-parsed stanzas in-memory instead of only fully-parsed stanzas. I strongly dislike this, but it's seemingly the only way to avoid re-parsing earlier lines of very large stanzas, and it's what age does.

str4d avatar Feb 20 '23 01:02 str4d

I added a Decryptor::new_buffered API, and implemented a basic change to use BufRead::read_until('\n', ..) instead of reading precisely what the parser asks for. The effect on the header-parsing benchmarks (Ryzen 9 5950X, parsing a header containing 1-10 native age recipients):

Before

header/parse/1          time:   [27.608 µs 30.490 µs 33.336 µs]                            
                        thrpt:  [29.998 Kelem/s 32.798 Kelem/s 36.221 Kelem/s]
header/parse/2          time:   [55.729 µs 59.756 µs 63.719 µs]                           
                        thrpt:  [31.388 Kelem/s 33.469 Kelem/s 35.888 Kelem/s]
header/parse/3          time:   [96.203 µs 101.46 µs 106.67 µs]                           
                        thrpt:  [28.124 Kelem/s 29.569 Kelem/s 31.184 Kelem/s]
header/parse/4          time:   [152.19 µs 158.87 µs 165.37 µs]                           
                        thrpt:  [24.189 Kelem/s 25.177 Kelem/s 26.283 Kelem/s]
header/parse/5          time:   [214.20 µs 223.40 µs 232.31 µs]                           
                        thrpt:  [21.523 Kelem/s 22.381 Kelem/s 23.342 Kelem/s]
header/parse/6          time:   [290.83 µs 301.38 µs 312.14 µs]                           
                        thrpt:  [19.222 Kelem/s 19.908 Kelem/s 20.631 Kelem/s]
header/parse/7          time:   [400.78 µs 411.78 µs 423.08 µs]                           
                        thrpt:  [16.546 Kelem/s 16.999 Kelem/s 17.466 Kelem/s]
header/parse/8          time:   [514.58 µs 528.42 µs 542.48 µs]                           
                        thrpt:  [14.747 Kelem/s 15.140 Kelem/s 15.547 Kelem/s]
header/parse/9          time:   [636.98 µs 652.54 µs 667.84 µs]                           
                        thrpt:  [13.476 Kelem/s 13.792 Kelem/s 14.129 Kelem/s]

After

header/parse/1          time:   [1.8364 µs 1.9232 µs 2.0160 µs]                            
                        thrpt:  [496.04 Kelem/s 519.96 Kelem/s 544.54 Kelem/s]
                 change:
                        time:   [-94.062% -93.586% -93.044%] (p = 0.00 < 0.05)
                        thrpt:  [+1337.6% +1459.1% +1584.0%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
header/parse/2          time:   [2.8876 µs 2.9794 µs 3.0753 µs]                            
                        thrpt:  [650.34 Kelem/s 671.27 Kelem/s 692.62 Kelem/s]
                 change:
                        time:   [-95.384% -95.128% -94.844%] (p = 0.00 < 0.05)
                        thrpt:  [+1839.4% +1952.4% +2066.5%]
                        Performance has improved.
header/parse/3          time:   [4.2078 µs 4.3345 µs 4.4631 µs]                            
                        thrpt:  [672.18 Kelem/s 692.13 Kelem/s 712.95 Kelem/s]
                 change:
                        time:   [-95.887% -95.716% -95.525%] (p = 0.00 < 0.05)
                        thrpt:  [+2134.8% +2234.1% +2331.1%]
                        Performance has improved.
header/parse/4          time:   [5.8106 µs 5.9730 µs 6.1503 µs]                            
                        thrpt:  [650.38 Kelem/s 669.68 Kelem/s 688.40 Kelem/s]
                 change:
                        time:   [-96.337% -96.197% -96.043%] (p = 0.00 < 0.05)
                        thrpt:  [+2427.0% +2529.4% +2630.0%]
                        Performance has improved.
header/parse/5          time:   [7.8963 µs 8.0530 µs 8.2136 µs]                            
                        thrpt:  [608.75 Kelem/s 620.89 Kelem/s 633.21 Kelem/s]
                 change:
                        time:   [-96.446% -96.326% -96.200%] (p = 0.00 < 0.05)
                        thrpt:  [+2531.3% +2621.9% +2713.5%]
                        Performance has improved.
header/parse/6          time:   [10.423 µs 10.623 µs 10.840 µs]                            
                        thrpt:  [553.51 Kelem/s 564.82 Kelem/s 575.65 Kelem/s]
                 change:
                        time:   [-96.554% -96.456% -96.357%] (p = 0.00 < 0.05)
                        thrpt:  [+2644.7% +2721.7% +2802.3%]
                        Performance has improved.
header/parse/7          time:   [12.668 µs 12.849 µs 13.049 µs]                            
                        thrpt:  [536.44 Kelem/s 544.79 Kelem/s 552.59 Kelem/s]
                 change:
                        time:   [-96.865% -96.784% -96.702%] (p = 0.00 < 0.05)
                        thrpt:  [+2931.8% +3009.6% +3090.2%]
                        Performance has improved.
header/parse/8          time:   [15.245 µs 15.464 µs 15.709 µs]                            
                        thrpt:  [509.25 Kelem/s 517.32 Kelem/s 524.75 Kelem/s]
                 change:
                        time:   [-97.082% -97.015% -96.947%] (p = 0.00 < 0.05)
                        thrpt:  [+3175.6% +3250.1% +3326.6%]
                        Performance has improved.
header/parse/9          time:   [19.092 µs 19.343 µs 19.607 µs]                            
                        thrpt:  [459.02 Kelem/s 465.29 Kelem/s 471.40 Kelem/s]
                 change:
                        time:   [-97.074% -97.015% -96.950%] (p = 0.00 < 0.05)
                        thrpt:  [+3179.2% +3249.9% +3317.3%]
                        Performance has improved.

Soooo I guess this means I'm going to add buffered versions of my constructors. The very large header is still quadratic after this change, so I'll still need to consider breaking the type safety of the current types, or at least figure out a way to encapsulate the partial parsing.

str4d avatar Feb 20 '23 03:02 str4d

#377 implements the above, which also required impl io::BufRead for ArmoredReader in order for rage to use it. I still need to implement Decryptor::new_async_buffered and impl AsyncBufRead for ArmoredReader to provide the equivalent performance improvements for the async APIs.

I've also confirmed that this change is insufficient for the long_recipient.zip test case to work with rage. So I'll look into breaking apart the internal type safety next (and hope that I can keep this breakage internal).

str4d avatar Mar 24 '23 23:03 str4d

#393 implements Decryptor::new_async_buffered and impl AsyncBufRead for ArmoredReader to provide the equivalent performance improvements for the async APIs.

I'm going to deploy these performance improvements in rage 0.9.2, and defer the rest of this issue to later.

str4d avatar Jun 12 '23 18:06 str4d