c2rust Append to `Metadata` and merge upon deserialization

Fixes #586.

Atomically append to the Metadata file (error if not written all at once) and then merge upon deserialization.

See https://github.com/immunant/c2rust/issues/586#issuecomment-1210068124 for a discussion of solutions, of which this is one.

For each rustc wrapper call, we open in append mode, serialize to a Vec<u8>, and then do a single write (not write_all). If the write doesn't write all of its data, error, as only single writes are atomic*. This will leave consecutive Metadatas in the file, so when reading, we need to keep deserializing Metadatas until we reach the end of the file, and then merge the Metadatas into one.

* This should be fine unless there is a signal that interrupts the write (or are there other cases the full write wouldn't go through?), or if the write is > 2 GB: from write(2):

On Linux, write() (and similar system calls) will transfer at most 0x7ffff000 (2,147,479,552) bytes, returning the number of bytes actually transferred. (This is true on both 32-bit and 64-bit systems.)

This was easier to implement than the other option (write to separate files and then read them in, merge, and write the final one), but it may be more brittle if writes don't write all at once (write is atomic on almost all platforms and filesystems (not old NFS versions)). It currently works to instrument and run lighttpd.

Aug 10 '22 22:08 kkysen

I'm not the right person to review this - I haven't worked on any of this code. I also still don't like this design, since it's unnecessarily reliant on implementation details of the platform.

Aug 10 '22 23:08 spernsteiner

I'm not the right person to review this - I haven't worked on any of this code. I also still don't like this design, since it's unnecessarily reliant on implementation details of the platform.

I added you since I thought you have opinions on the implementation strategy. My thinking is that it's simpler to implement, most of this needs to be implemented for the other strategy anyways, we want a working solution sooner, and it's not likely to break at all for any of our current use cases (and if does so, it'll panic, not silently corrupt things). By the time we need something more robust, we may already have moved on from the current implementation. That being said, I can work on the other implementation next, on top of this.

Aug 11 '22 21:08 kkysen

My thinking is that it's simpler to implement, most of this needs to be implemented for the other strategy anyways, we want a working solution sooner, and it's not likely to break at all for any of our current use cases (and if does so, it'll panic, not silently corrupt things).

Why is this urgent; is this blocking someone? Generally, we want to avoid temporary solutions because there is a risk that they become permanent. How much time do you think you'll need to address Stuart's concerns?

This was easier to implement than the other option (write to separate files and then read them in, merge, and write the final one), but it may be more brittle

I don't like that we're prioritizing ease of implementation over robustness. Even if the overall lifting feature is not supported on macOS, being platform agnostic is preferable IMHO.

Aug 11 '22 21:08 thedataking

Why is this urgent; is this blocking someone? Generally, we want to avoid temporary solutions because there is a risk that they become permanent. How much time do you think you'll need to address Stuart's concerns?

About 90% of this solution is also necessary for the separate files one, so to me it makes sense to merge this one first, and then work on the other solution as an improvement. This way we keep the PRs smaller and more self-contained, and also get a working solution merged so that lighttpd can be regression tested (we often want to test lighttpd while changing the instrumentation code to make sure things still work).

There are also tons of other places where we prioritize a simpler, more narrow solution first to get things in working order before working on a better, more general solution.

Aug 11 '22 22:08 kkysen

Yes, this adds new portability questions that we previously didn't have to worry about. The Linux atomic-write limit is high enough for our purposes, but the fact that there's a limit at all suggests that arbitrary limits are allowed by the spec. Now we have to wonder: what does the limit look like on macOS, Windows, or FreeBSD? Are there certain kernel configurations that would reduce the limit? Can different filesystems have different limits? And so on.

The approach using separate metadata files avoids all these questions, and it doesn't sound like it will be too much more complex (though I admit I don't understand all the requirements here).

Aug 11 '22 22:08 spernsteiner

I'm going to implement that more robust solution; I just want to merge this first. Almost all of this code is necessary for the other solution, too, and it fixes things in the meantime well-enough that we can test lighttpd on our machines.

Aug 11 '22 22:08 kkysen

Could we use file locking to synchronize the updates to the metadata file? I found the fs2 crate that supports this cross-platform (see the trait at https://docs.rs/fs2/latest/fs2/trait.FileExt.html).

Aug 11 '22 22:08 ahomescu

Could we use file locking to synchronize the updates to the metadata file? I found the fs2 crate that supports this cross-platform (see the trait at https://docs.rs/fs2/latest/fs2/trait.FileExt.html).

We could try that, too. A file-locking solution would also re-use 90% of the changes in this PR, so I still think we should merge this one first (and a good reason to, too, as it's a common base from which we can implement either separate files that are merged or file locking).

Aug 11 '22 23:08 kkysen

I'm going to extract the majority of this PR that deals with reading multiple Metadatas from a file, as we'll need that for any of the solutions.

Aug 12 '22 00:08 kkysen

Closing in favor of #604 (subset of this) and #605 (append with file locking).

Aug 12 '22 01:08 kkysen

c2rust c2rust copied to clipboard

Append to `Metadata` and merge upon deserialization

c2rust
c2rust copied to clipboard