image icon indicating copy to clipboard operation
image copied to clipboard

Generics usage by decoders increase code size and compilation time

Open fintelia opened this issue 7 months ago • 25 comments

This program takes 17 seconds to compile and produces a 2.4 MB binary with 1.2 MB of that being instructions in the .text segment:

use std::{hint::black_box, io::{BufRead, Cursor, Seek}};
use image::ImageReader;

fn decode<R: BufRead + Seek>(slice: R) {
    black_box(ImageReader::new(slice).decode()).unwrap();
}

fn main() {
    let mut f = Cursor::new(black_box(&[]));
    decode(&mut f);
}

Whereas adding more instantiations of ImageReader with different generic arguments drastically increases those numbers. Adding ten more copies increases compile time to 22 seconds and the resulting 4.8 MB binary contains a 3 MB .text section:

// ...

fn main() {
    let mut f = Cursor::new(black_box(&[]));
    decode(&mut f);
    decode(&mut &mut f);
    decode(&mut &mut &mut f);
    decode(&mut &mut &mut &mut f);
    decode(&mut &mut &mut &mut &mut f);
    decode(&mut &mut &mut &mut &mut &mut f);
    decode(&mut &mut &mut &mut &mut &mut &mut f);
    decode(&mut &mut &mut &mut &mut &mut &mut &mut f);
    decode(&mut &mut &mut &mut &mut &mut &mut &mut &mut f);
    decode(&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut f);
    decode(&mut &mut &mut &mut &mut &mut &mut &mut &mut &mut &mut f);
}

What's going on here is that because ImageReader and each of the underlying decoders are parameterized by the reader type, when we instantiate them using 11 different reader types, that actually causes rustc to compile 11 copies of our PNG decoder, 11 copies of our JPEG decoder, 11 copies of TIFF, and WebP, and so on. Then all those copies get fed into LLVM for codegen and optimization and the resulting compile times and output sizes speak for themselves.

What should we do?

I think this starts to call into question our strategy of parameterize decoders by std::io::Read (and sometimes Seek or BufRead). Any methods that aren't generic will get compiled only once, and in parallel as the specific format crate is being compiled. But generic code can get compiled many times.

There's been some talk about trying to move towards a "sans io" style where an encoder or decoder would operate on byte slices rather than directly pulling new input data from a reader or pushing output data into a writer. I think that's a promising point to explore. Even if we didn't do a full switchover, just moving some parts of each decoder to that style could significantly cut down on the amount of code that gets replicated.

I'd also be open to other thoughts / ideas or if there's some aspect I'm overlooking. We will have to be mindful that compile time improvements don't lead to runtime regressions if we no longer specialize decoders to the specific reader type they're working with.

PRs #2468 and #2470 changes encoders and decoders, respectively, from being fully compiled when no downstream code uses them. That's a start, but doesn't address the underlying concern.

fintelia avatar May 24 '25 20:05 fintelia

From my perspective, it seems that this is more often done by creating two separate APIs: one that accepts a slice, and another that accepts a dynamic stream (Box<dyn Read>).

When accepting a slice, all decoders will be monomorphized once, and everything should work fine.

If it is actually a real stream underhood ( file, network, file descriptor ) then dynamic indirection on calls Box<dyn Read> is still much cheaper than work that streaming trait will perform. Thus, if the API is used correctly, there is technically no performance loss.

This still require to compile all the decoders 2 times, but covers a huge amount of possible cases.

awxkee avatar May 24 '25 21:05 awxkee

I agree with @awxkee, this might mostly be a case of bad defaults. If we're reading large chunks of data then the overhead of a single dyn Read (if deeply nested that'd be bad) should not be too much. Of course, users lose the ability for the compiler to specialize their code over such readers but apart from maybe Cursor<&[u8]> that is rather unlikely to happen too much and we have _from_memory for this case.

For dyn BufRead users it's a little more complex, because code patterns could call consume a lot with small counts. This would obviously have more dramatic runtime implications.

But overall it should probably be fine to offer the default API in a way that is not monomorphized and instead have maybe a specialize way (i.e. going through the io module) to get this effect instead.

One note though is that we need to pass dyn Read + BufRead + Seek + 'a to cover the use cases, a dyn Read would be 'static, so a subtrait grouping those would be necessary. Extra lifetime generics do not contribute to monomorphization though, it might just become a little more complicated than a straightforward change. And we may need to target rust-version 1.86 for trait upcasting to implement this ergonomically.


On that note, flexible-io was designed with bloat from monomorphization in mind. I don't expect us to change decoders to that interface immediately when we can do the combined but if someone wants to take it for a spin, see how it integrates into decoders, please do. It would also work around the upcasting problem but the implementation depends on internals or the stabilization of <*mut T>::with_metadata_of so this is really experimental only.

197g avatar May 25 '25 09:05 197g

With dyn Read we'd want to avoid making lots of small read_exact calls. If anything, that's harder to avoid than frequent consume calls because they're trickier to aggregate. I sort of wonder if the two readers we should support are Cursor<&[u8]> and BufReader<Box<dyn Read + Seek>>

fintelia avatar May 25 '25 22:05 fintelia

Calls on dynamic method usually costs ( with context saving & restoring ) 15 - 30 CPU cycles on some the worst cases.

If method needs to be loaded from RAM to L1 cache, then it can incur additional 50-150 cycles to wait ( might be more, but this can also happen on on static path also ). This is most likely should happen only once on decoding path, but it is not guaranteed.

For example, decoding a 100 KB image in 1 KB chunks would involve about ~100 calls. Even in a pessimistic case, the total overhead would be approximately 30 * 100 + 150 = 3150 additional CPU cycles, which is very low compared to the overall cost of image decoding.

To incur a measurable penalty read_exact should be called thousands of times, which is mostly should not be a case. Where it is actually happen, it might be more like API misusing.

Even for very small images a few dozen of additional cycles wouldn't do any difference.

If somewhere a hundred buf.read_exact(1), this is likely needs to be fixed.

awxkee avatar May 25 '25 23:05 awxkee

If somewhere a hundred buf.read_exact(1), this is likely needs to be fixed.

Uh, about that. Straightforward implementations of RLE decoding actually involve doing tons of tiny reads. Our HDR decoder does single-byte read calls and our TIFF decoder also sometimes does single-byte read_exact calls.

More sophisticated formats don't use RLE for compression, but the entropy coding they rely on also might be implemented using small I/O operations. For instance, our WebP decoder does fill_buf+consume in 8-byte (or fewer) increments.

You could argue that these functions should instead do a single large read and then perform subsequent decompression without needing further I/O against the underlying reader. But that's unnecessary when working with a Cursor<&[u8]> and basically what you get for free if you wrap the reader in a BufReader...

fintelia avatar May 26 '25 00:05 fintelia

I've struggled with the same issues when refactoring the gif codec. The more I work on this, the more I think these are just flawed APIs, and std::io has no attractive solution.

io::Read doesn't support zero-copy access, even when it's backed by an in-memory buffer.

io::BufRead can expose slices without an extra copy, but it can't guarantee anything about a minimum slice size. Even though it typically allocates an 8KB buffer, the interface requires supporting slices that are 1 byte long. This has lead some codec implementations to load 2-byte or 4-byte integers byte by byte using a state machine. That's quite a lot of code for something that could have been a single unaligned read instruction, with a right API.

io::Seek bound supports going backwards without any limits, so files from the network or stdin need to be saved in full. This isn't a major issue in practice, but the needless allocations, larger buffers, and unused copies show it's another inefficiency/limitation of the APIs.

io::Error doesn't get completely optimized out, even when the concrete R: io::Read type is a slice.

Rust lacks stable specialization, so it's hard to avoid buffering data twice. Generic APIs using these traits can't have fast/lean paths for slices and Vecs, and there's no elegant way to know when to wrapping generic I/O in a BufReader would help or needlessly add double buffering.

None of that works for AsyncRead without extra threads and channels, even when the underlying codec implementation uses a state machine that could pause.

And none of that works in no-std environments.

kornelski avatar May 26 '25 15:05 kornelski

I'm thinking that we'd never want to have the lowest-level leaf functions, like a Huffman decompressor, operating on anything other than a slice.

Calling any I/O API for every byte or a couple of bytes is always going to be bloated. There's a code cost of having to conditionally call a refill function. It's going to be a non-inlined call, so there's code bloat of register saving, code that checks the I/O result, and branches for retrying or returning. That's a lot more compared to a buffer length check and a pointer increment, which would also optimize better due to their side-effect-free simplicity.

So if we start with an assumption that I/O can't be on the hot path, and we're going to need to use slices (buffers) anyway, then we don't need to worry about overhead of dyn calls.

For maximum performance, I think we only need two monomorphisations:

  1. decoding directly from an in-memory buffer, where I/O can't fail (other than EOF).
  2. decoding from some dyn trait, and our custom buffering strategy. If we don't use dyn io::Read directly, then maybe we could have special handling of io::BufRead too.

kornelski avatar May 26 '25 15:05 kornelski

I think there's some tradeoffs involved in how to structure the leaf functions, particularly to minimize the amount of effort required to convert all our existing decoders.

The ideal design is probably having a decompressor implemented as a state machine that is called with a slice of input data and can return back to the caller when it needs more input. But that's a pretty complicated/error prone to implement. When I did it for fdeflate it took significant testing and fuzzing to gain confidence in the implementation.

At the same time, an I/O API doesn't have to be that bad. It could be an inlined method that does a bounds check on the remaining contents of an in-memory buffer, and only calls into the cold-path refill logic (including a dyn call) if the buffer is empty. In return, you get a much simpler programming model for writing the decoder: just a series of read_exact calls to pull data in the expected order. That's probably why most of our simpler decoders are written in this style.

One option would be to switch our decoders to either be the state machine style or only work on in-memory buffers. It would be unfortunate to lose the ability to do streaming decodes for certain formats, but I'm not whether sure folks are actually loading giant BMP / TGA / HDR / etc. files where it would make a difference?

There's also the question of what to do about Seek. The TIFF decoder in particular needs to constantly jump around to different locations in the file to load data. (Admittedly, even the current design with a BufReader doesn't handle this well: each seek causes it to dump the entire 8 KB buffer and then do a fresh 8 KB, even if you only need a few bytes.)

fintelia avatar May 26 '25 20:05 fintelia

Better abstractions on top of sans-IO that can communicate with an underlying transformer interface would be really neat. That way only that underlying interface needs to demonstrably preserve correctness and all the IO can be mocked with equivalent byte slice passing. The higher level approach may be to viewed as driving an underlying effectful coroutine. That wait there is a state machine transition function that is monomorphic even with different types of io happening around that step function instead of inside the function, which should result in the right effect on code size.

All the difference to the std traits is inverted control. Instead of calling seek or fill_buf we instruct some caller to do that on the source stream, and resume with the updated buffer. Surely it'd be nicer to write this with proper generator support with recent discussion by lang-team or even effects; how do we abstract that without? With regards to async it may also be neat if we could instruct the caller to hold multiple offsets simultaneous which would address the tiff/Seek issue. We have a list of open regions of interest to be read next and get resumed whenever some of them are available, while the fetch to incomplete regions is already scheduled in the background.

But that is already quite complex. For the simpler sequential construction, weezl seems in a decent spot. It has a sans-IO core on which higher levels are built. I still have big troubles seeing how to demonstrate the equivalence of all input chunkings in that implementation but at least there is no worry that the chunking may be broken for one mode of IO. Either it works for all, or it doesn't. This currently comes with the trade-off of also dealing with slices of 1 byte since communicating a required minimum length in the interface is .. hard .. without much terminology to mention that requirement in the type system.

A sketch of effects in pseudo-code

No matter what, when we call BufRead::consume we give up the whole input slice and get a new one. This must however semantically preserve the tail of data that was not consumed but since there is no borrow that is hard to enforce. However I could see us enforcing the part of a large filled buffer, except for the EOF case. With effects, in pseudo-syntax, something like this would be nice to read imo:

effect EffectfulBufRead {
    // The effect context is expected to fill the error details in when
    // it was not handled, this is not parameterized over ErrorKind.
    effect consume(count: usize) -> Result<NextBuf<'_>, png::ErrorKind>;
}

// I don't actually know how to write this.. so let's focus on reading..
effect EffectfulWrite {}

enum NextBuf<'data> {
    // There is at least one more page of data.
    FullPage(&'data [u8; 4096]),
    // We have less than 4096 bytes until eof remaining, all of them here.
    Eof(&'data [u8]),
}

fn decode(effect s: impl EffectfulBufRead)-> Result<Decoded, ErrorKind> {
  let mut result = …;

  let mut consumed = 0;
  let tail = loop {
    let input = array_of_4096_bytes = match s.consume(consumed).effect {
        FullPage(next_data) => next_data,
        PartialEof(tail) => break tail,
    };

    // … interpret part of input
    consumed = sans_io_inner(input, &mut result);
  }

  sans_io_tail(tail, &mut result);
  Ok(result)
}

Seeking could also be an effect. This is a type-safe variant of what wuffs et.al are doing; here resumption must correspond (in a type-system way) to the method by which the coroutine was suspended but we're still 'just' handling two data buffers underneath. Can we write something similar already?

effect EffectfulSeek {
    effect seek(off: SeekOffset) -> Result<NextBuf<'_>, png::ErrorKind>;
}

While having effects in the same sense as async/await would be nice, it's 'just' sugar around a state machine. The difficult part here would be wrapping all the internal in a way that the transformation is still done by the compiler for us; but we can also do standard synchronous step function we have in weezl. Could some some internal interface be Future, that we never expose? It shouldn't be necessary to have the protocol type interface that comes with effects for the standard IO over buffers. While stepping we collect information on what we want the caller to do while exposing interface within the coroutine to (mutably) reference the available buffers between yields. There's probably going to be a need for some unsafe to work around the borrow checker not allowing us to pass that state on continuation but it sounds roughly feasible to wrap?

197g avatar May 27 '25 17:05 197g

It could be an inlined method that does a bounds check on the remaining contents of an in-memory buffer, and only calls into the cold-path refill logic

That's what I had in mind. This is still generating a significant code bloat, because you have that refill-and-check code on every access. It's a low compute overhead per se, but the usually-not-taken code is always there. return (read(1)?, read(1)?, read(1)?) generates worse code for buffer-and-refill than a pure slice that can't refill. This is especially true in loops where the refill strategy forces LLVM to just run the code as-is, instead of rearranging the loop to amortize the checks (similar reason to why slice[0] + slice[1] +slice[2] is relatively slow and bloated, but if slice.len() > 2 { slice[0] + slice[1] +slice[2] } is fast and efficient).

a state machine that is called with a slice of input data […] pretty complicated/error prone to implement

This is why I'm complaining that minimum buffer size in BufRead is 1. This forces the state machine to support reading only 1 byte at a time, which is overcomplicating things, and creating too many states in too many inconvenient places. Generators (or just await) can make the code more manageable, but don't fix the code bloat caused by excessively granular state and too many yield points.

Ability to ask for a buffered slice of at least some minimum length removes a lot of states and simplifies the code. You can just ask for 4 contiguous bytes to read an int, instead of having four state machine states capable of yielding after each byte.

just a series of read_exact calls

This is convenient and close to zero cost for parsing headers, but for pixel data you may need a slice instead. When you need a slice, then a plain io::Read would need an extra buffer and extra copying, especially wasteful when reading from a slice. This brings us back to io::BufRead, which isn't an ideal API.

The TIFF decoder in particular needs to constantly jump around

Given that most codecs don't need that, and TIFF is relatively rarely used, maybe we shouldn't require Seek, and the TIFF codec should buffer the data itself? It should be possible to avoid unnecessary buffering, e.g. don't buffer if BufRead.fill() gives you the whole file. It'd be even easier to control if we had our own I/O buffer abstraction. For real streaming when buffering is needed, the codec may even save memory by keeping only parts of the file that it will need to seek to, and freeing past buffers as soon as it's done with them. AFAIK EXIF will have offsets within the EXIF data, but it's not supposed to jump into the middle of the pixel data, so the pixel data doesn't need to be buffered for seeking.

kornelski avatar May 28 '25 11:05 kornelski

enum NextBuf<'data> {
    // There is at least one more page of data.
    FullPage(&'data [u8; 4096]),
    // We have less than 4096 bytes until eof remaining, all of them here.
    Eof(&'data [u8]),
}

This is an improvement over BufRead.fill(), but I don't get why effects are needed.

I'm assuming that consume() on this memmoves the data in FullPage from the unconsumed end to the beginning of the buffer, and tops it up. Without memory-mapping cleverness (which I'm not suggesting), we can't keep the old slice between refills, so from the borrow checker perspective the current Rust behavior is okay.

The code will have to yield and refill sometimes, but the guaranteed minimum buffer size will let it consume fixed-size fields without extra yields.

However, maybe the FullPage size should be dynamic? For example, when parsing chunks of (length, [data]), I may want to get the entire data in one slice. It may be small in practice, but may be theoretically larger than some fixed-size buffer, so the code would need ability to do a heap allocation if streaming.

kornelski avatar May 28 '25 11:05 kornelski

Implementation sketch: https://github.com/197g/not-io/blob/sans-io/sans-io/tests/from_async.rs

Currently, handling of requests are done by inspecting a Demand enumeration that is returned from each poll of the underlying coroutine. An effect context would codify which type must be re-injected into the coroutine for each variant. It should also be more modular as that enumeration needs to be manually written for all kinds of effects. But it's true, it is not strict necessary.

Also in the sketch I've not used the fixed arrray but opted for a dynamic buffer length compatible with io::BufRead. Somewhat surprisingly, std::io::BufWriter can not be used for the purpose as there is no way of passing the internal buffer mutably, so no zero-copy writing with that type.

Ability to ask for a buffered slice of at least some minimum length removes a lot of states and simplifies the code.

Interesting design thought, I'll try to integrate that in the sketch.

197g avatar May 28 '25 13:05 197g

Additional design nugget I stumbled onto: When we are constructing a coroutine or closure, we may move some data into that type. This data is then accessibly only within the created block. If it were possible to annotate such a move in a way that makes the data also mutably accessible from the coroutine object itself¹, the code would be much nicer. I'd like this:

struct IoController { /* internals that represent the current buffer */ }

let controller: IoController = …;
let mut the_decoder: impl Future<Output = ()> = async move {
    controller.read(some_amount).await;
    controller.write(some_amount).await;
};

// In the context where we implement the await (here synchronously):

match demand {
    Demand::Write(len) => write!(f, the_decoder.controller.write_buf[..len])?,
                                               ^^^^^^^^^^^
                                               new awesome syntax
    …
}

Currently the only way to share such an attribute with both sides is by complicated pointer shuffling into the waker. (Maybe get better with Context::ext() -> &dyn Any but it's still no fun). Also this would be the basis of effect handlers. When invoking an effect, the handler must be able to access and modify state of the coroutine. Such state can not be a local variable in async since it must live and not be moved from and requiring an explicit syntactically pin! calls seems .. error prone and unteachable.

¹while it is not running. And also the annotation is there to check that the bound attribute is never moved from within the coroutine stack.

197g avatar May 28 '25 13:05 197g

Soo ... with even more complication my implementation now passes under Miri. It'd would so much easier if we could simply access the coroutine state by an attribute. The hacks around using a special Waker just so we get a special Context just so we can pass the buffers without invalidating any of the mutable references involved is outrageously expensive in concepts and even code.

197g avatar May 28 '25 14:05 197g

Our current HDR decoder might be a good test case for possible designs. In particular, the RLE decoding portion currently uses read_exact to decode pixel components and currently lacks any explicit state machine logic. The state of which image row, what color component, and how many pixels have been decoded are all just tracked as local variables or implicitly based on which line of code is currently executing.

It would of course be possible to re-write the decoder in a sans-io or BufRead style using a state machine. But is there a way we can get the performance we want for our more widely used decoders, avoid code size/compile time bloat, and still avoid the effort and complexity of a complete rewrite for simple decoders like this one?

fintelia avatar May 28 '25 17:05 fintelia

How about do the following:

  1. Since there seems to be agreement that nobody likes the current BufReader, we can copy its implementation and change the defaults to ones we prefer. This will also allow us to inline calls.
  2. For decoders that are not problematic, we'll leave them unchanged. Just feed them dyn Read thats all.
  3. For problematic decoders wrap a dyn Read into our implementation. Since mostly those are not a very widely used codecs, it shouldn't introduce significant overhead.

I know thats not perfect, but it's straightforward to implement and test. And one of main point that most problematic decoders is BMP / TGA / HDR it feels could be possible to introduce just for them possibility of double-buffering as a trade-off.

awxkee avatar May 28 '25 17:05 awxkee

Fun fact: BufReader<BufReader<R>> doesn't double-buffer if they are both constructed with default buffer size. The outer reader will issue 8 KB reads, and the inner reader will skip the buffer for any reads >=8 KB. The only waste is a 8 KB heap allocation that's never used

fintelia avatar May 28 '25 20:05 fintelia

One problem with creating our own BufReader is that all MaybeUninit<u8> based reads are unstable. Hence, such a reader will necessarily have to initialize its bytes before passing them to the underlying R: Read. This is a performance difference we can hardly make up when the buffer actually needs to grow. Just keep that in mind for benchmark results.

197g avatar May 29 '25 11:05 197g

I actually think it just adds a bit more boilerplate to rewrite all that bound to MaybeUninit<u8> in safe Rust. The buffer is allocated only once, and we afaik don't intend to allocate more than one buffer on the top. Therefore, one additional zero initialization on the buffer shouldn't really make a difference. However, as @fintelia noted, we could wrap BufReader<BufReader<R>> for BMP / TGA / HDR.

awxkee avatar May 29 '25 14:05 awxkee

BufReader doesn't have any specialization, so BufReader<&[u8]> still naively buffers, so it wouldn't be great to take io::Read and always wrap it in a BufReader.

Maybe the codecs could have separate methods for taking io::Read (and wrapping it), and io::BufRead?

Or how about taking io::Read + Any and trying to implement a DIY dynamic specialization? (without Any on the type's bound).


There's a proposal that would make 1-byte fill_buf less bad:

https://github.com/rust-lang/libs-team/issues/569 https://github.com/rust-lang/rust/issues/128405

kornelski avatar May 29 '25 15:05 kornelski

BufReader doesn't have any specialization, so BufReader<&[u8]> still naively buffers, so it wouldn't be great to take io::Read and always wrap it in a BufReader.

Maybe the codecs could have separate methods for taking io::Read (and wrapping it), and io::BufRead?

Kind of is part of idea.

The issue having different implementations for Read and BufRead is that it will require to compile all the codecs instead of 2 times- 3 times.

awxkee avatar May 29 '25 15:05 awxkee

But is there a way we can get the performance we want for our more widely used decoders, avoid code size/compile time bloat, and still avoid the effort and complexity of a complete rewrite for simple decoders like this one?

I've experimented with this a bit; mixed results. So without further context here is the rle code just parsing data, not returning it in any way:

Method                Mean           Samples   
-----------------------------------------------
sans-io-16b          173.40 ns   97,542/100,000
with-io-16b          82.04 ns    97,775/100,000
sans-io-8kb          782.50 ns    9,732/10,000
with-io-8kb          2.22 μs      9,877/10,000

As an interesting data point, actually having callbacks leads to worse codegen.

Method                Mean           Samples
--------------------------------------------
sans-io          194.06 ns    97,010/100,000
width-io          90.43 ns    96,995/100,000
sans-io-long       7.88 μs      9,699/10,000
width-io-long      5.03 μs      9,729/10,000

In the rle code we assign data into an outer vector through a callback (|pos, val| block[pos] = val). llvm no longer eliminates any of these bounds checks when I try to write this with async style—and consequently code spends 60% of its time in that loop.. Which is curious since of course the slice is still taken by mutable reference and all of this could be inlined but that gets lost somewhere as it is no longer local to the execution of the loop, or some of the additional potential writes in the coroutine drop this. I may investigate further.

However, as we also see in some cases the style leads to better codegen. Note that the tiny example has setup overhead as it must initialize the task::Context. This is constant overhead in a bigger system.

So, it definitely isn't a simple rewrite but a very interesting style.

197g avatar May 30 '25 17:05 197g

I would be keen to see all std::io types/traits eliminated from (or made optional in) images API, just so that it is usable in no_std environments (A flexible sans-io API that we build other APIs on top of would be awesome if we could get it while still retaining performance, but even an API like the current one but with custom types/traits could work).

As a concrete use case, vello_cpu is canvas-like renderer which is entirely no_std compatible except for it's dependency on the png crate (which I believe it managed by the image-rs project and in-scope for this discussion) which it uses to decode emoji embedded in fonts.

nicoburns avatar Jun 08 '25 23:06 nicoburns

The ideal design is probably having a decompressor implemented as a state machine that is called with a slice of input data and can return back to the caller when it needs more input. But that's a pretty complicated/error prone to implement. When I did it for fdeflate it took significant testing and fuzzing to gain confidence in the implementation.

I've tried to implement a slice-based state machine for the TGA decoder (see test branch https://github.com/mstoeckl/image/commits/slice-tga/), and it doesn't seem that bad. But I am heavily using the fact that the TGA file contains just a single long RLE encoded segment and its RLE format has single byte opcodes (which allow identifying run properties even with BufRead's single byte minimum, and ensures all runs are short enough to copy onto the stack when a reader implementing BufRead doesn't have a large enough buffer).

  • slice decoder: https://github.com/mstoeckl/image/blob/7d389bfdd6922d675c7519dc815480813bb61c88/src/codecs/tga/decoder.rs#L80-L173
  • buffering logic: https://github.com/mstoeckl/image/blob/7d389bfdd6922d675c7519dc815480813bb61c88/src/codecs/tga/decoder.rs#L436-L526 .)

I haven't tried it, but I think BMP could be handled with a similar slice decoding function, with the only state being the current (x,y) painter position in the image; HDR on the other hand has scanline structure and an old RLE method with 4 byte opcodes, which looks like it requires a somewhat complicated state machine or somewhat complicated code to efficiently run a slice-based decoder with input from BufRead.

mstoeckl avatar Sep 14 '25 13:09 mstoeckl

Now that we can make breaking changes, it's a good idea to revisit #2468 and make other changes before freezing the API again for a while.

Shnatsel avatar Nov 18 '25 19:11 Shnatsel