parley icon indicating copy to clipboard operation
parley copied to clipboard

[parley] Limit input text size to 2^30 bytes

Open dfrg opened this issue 1 year ago • 2 comments

1 GB of text per layout is more than generous. We're likely to see severe performance degradation long before reaching this limit anyway suggesting that applications wanting to support excessively large buffers will need to do some segmentation and caching.

Rationale: this lets us use u32 instead of usize for internal indices which can significantly reduce memory requirements for internal data structures. Using 30 bits instead of 32 (or 31) allows us to occasionally steal a bit for other info (often incredibly useful) while still preserving the use of MAX_SIZE + 1 as the exclusive end boundary.

dfrg avatar Aug 18 '24 00:08 dfrg

To be clear, are you wanting MAX_SIZE + 1 to be $$2^{30}$$, or $$2^{30}-1$$? That is, do we have bits 30 and 31 available as "scratch", or just bit 31?

DJMcNab avatar Aug 19 '24 09:08 DJMcNab

Yes :) That’s my off by one error. I think we might as well reserve the 2 bits because 1 GB per layout seems to be plenty.

dfrg avatar Aug 19 '24 14:08 dfrg

I didn't find this issue and posted #408. Here's what I came up with independently (almost exactly the same conclusions as Chad 😂; please excuse me rediscovering Atlantis here):

Laying out 2³² bytes of simple ASCII text with reasonable line lengths in Parley takes hundreds of gigabytes of memory†; and likely over an hour of time to complete; and the resulting Layout is almost unusable on account of its size.

There is almost no practical application for Parley attempting to lay out 4 GiB (or even 1 GiB) of text in a single layout; realistically nobody is going to wait half an hour to lay out a gigabyte of text (one time) just for the convenience of it being in a single layout, and be willing to spend something approaching a terabyte of memory on actually using the result of that layout.

Given that indexes, both relative and absolute, are littered throughout Parley; and given that layouts larger than a few megabytes even are not practical, let alone layouts exceeding 2³¹ bytes, I propose we look at setting a limit per layout of something like 2³⁰ − 1 bytes. I suggest one less than 31 bits because it would allow us to use the most or least significant bit for other information; a simple (but maybe not that important) example being: cursor position being i32 with the least significant bit giving affinity, such that zero is the cursor at the start of the layout, (len << 1) + 1 is the cursor at the end of the layout, and cursor >> 1 is inherently a valid byte index in the layout (either the end or the start of a cluster).

Reducing the size of the index type would also improve the memory footprint of Parley quite a bit if done correctly, and maybe by reducing the theoretical maximum length of the text, we would add (I think) 20-30% to the de facto maximum length of a layout in a given application (constrained by memory, maybe a bit less when time constrained).

And since every other length or offset in Parley pertains to one or more bytes in the input text, this means that indexes for lines, clusters, runs, etc. could also be shrunk to i32.

† The most I can do on my 128 GiB machines is 2²⁹ bytes; it takes 16 minutes to do the initial layout and then I run out of memory if I try to iterate on lines, so it's useless anyway.

xorgy avatar Aug 27 '25 16:08 xorgy

Yes, I still think this makes sense. Note that HarfRust/HarfBuzz use a u32 for the cluster index (and HarfBuzz uses int for length when filling the buffer) so we’re in some ways already limited to a 32-bit range by the shaper.

I’m not sure exactly how or where this should be enforced. I imagine things will become more clear when we start work on parley_core.

dfrg avatar Aug 28 '25 11:08 dfrg

At the moment, the practical maximum text input size is 65535, as HarfRust uses u16 for various indices. That way it's quite easy to overflow (thread 'main' panicked at .../harfrust-0.3.1/src/hb/ot_layout_gsubgpos.rs:365:9:attempt to add with overflow).

This is the same limit that I remember from the good old Harfbuzz days. In my mind shaping makes sense to limit to a single paragraph of text, that's also we dealt with it in Qt: A large document of text is broken down into paragraphs, and each paragraph is shaped separately with the equivalent of a parley::layout::Layout.

Is that a safe assumption? Does that mean that the parley editor should also deal with multiple Layouts and let the cursor and selection go across them? Or would it be cleaner if parley::layout::Layout could handle large texts by itself and broke it down further?

tronical avatar Oct 13 '25 11:10 tronical

A few thoughts:

A large document of text is broken down into paragraphs, and each paragraph is shaped separately with the equivalent of a parley::layout::Layout.

Yes, I'd recommend this. Aside from anything else, this should make it cheaper to incrementally update text if only some of the paragraphs change. Blitz (which renders HTML) is currently separating each block-level HTML tag (e.g. <p>) but is not further splitting up paragraphs by analysing line breaks (this makes sense for HTML where whitespace is collapsed so line breaks within a <p> are relatively rare anyway). For some large wikipedia pages we have >1000 parley::Layouts which seems to work fine. Bonus: you can construct separate parley Layout's in parallel (which is a nice performance win given that shaping is relatively expensive).

At the moment, the practical maximum text input size is 65535

That limit probably applies to text with a single style. Because I think we're currently shaping each style span separately.

I think Parley probably ought to internally deal with the 65535 character limit if it does get hit (presumably by shaping 65535 characters, iterating backwards through the result until it finds a point at which it's safe to break shaping and then reshaping from that point).

Does that mean that the parley editor should also deal with multiple Layouts and let the cursor and selection go across them?

That has been our working assumption of how this ought to work (discussed in person at RustNL 2024), but nobody has actually implemented this yet. Blitz currently isn't dealing with this problem (we haven't run into it yet - I guess nobody is using Blitz with large text inputs).

nicoburns avatar Oct 13 '25 12:10 nicoburns

Thanks of the quick response. You're right, such a per-run limit could and probably should be dealt with inside Layout, but indeed not for the reasons of making it scale.

Great point about the parallel building - I hadn't thought of that :)

We're still in the process of moving Slint to use parley directly (which is going very well, btw), but I'd like to eventually get to the point of using the editor, and by that point I think we can try to contribute the multi-layout setup, unless somebody beats us to it :)

tronical avatar Oct 13 '25 13:10 tronical

At the moment, the practical maximum text input size is 65535, as HarfRust uses u16 for various indices. That way it's quite easy to overflow (thread 'main' panicked at .../harfrust-0.3.1/src/hb/ot_layout_gsubgpos.rs:365:9:attempt to add with overflow).

Looking at the referenced code, this appears to be a bug. The glyph_data field is used to track glyphs within a matching context which should be limited to a 64 glyph window. If you have a repro for this, please file a HarfRust issue.

More generally, HB/HR should handle runs larger than 64k.

dfrg avatar Oct 13 '25 15:10 dfrg