rust-msi icon indicating copy to clipboard operation
rust-msi copied to clipboard

Fix the problem with long strings

Open Landay7 opened this issue 1 year ago • 11 comments

Hi! Thank you for your work, I tried to play with it and figured out that there is an issue with the long string. Basically, it doesn't read fully long string and after that, all strings are messed up.

You can try it by yourself with the next example: Download SourcetreeEnterpriseSetup_3.4.15.msi and run:

cargo run --example msiquery "/path/to/SourcetreeEnterpriseSetup_3.4.15.msi"  "Select * from Property"

And you will see the difference.

If you have any comments or better name suggestions, please let me know, and I will fix it :)

Landay7 avatar Oct 29 '23 13:10 Landay7

Thanks for tracking this down and sending the PR! I tried out the example MSI and query you provided, and I can see that it gives garbled results without this change, and reasonable results with the change.

However...now I'm very confused, because this contradicts my understanding of how the string pool works. This PR seems to imply that the 32-bit length for a long string is stored after the initial (zero length, refcount) entry (and the last 16 bits are both the high word of the length and the refcount? That can't be right, can it?). But I had thought that the "refcount" field of the initial entry formed the high 16 bits of the length, and the "length" field of the subsequent entry formed the low 16 bits (and the "refcount" field of that second entry is the real refcount). And indeed, that's what the msitools implementation seems to do: https://gitlab.gnome.org/GNOME/msitools/-/blob/master/libmsi/string.c#L641. So is that implementation wrong, or is this MSI file incorrectly formatted?

Unfortunately I'm having trouble at the moment finding any citation for how the string pool is supposed to be formatted (beyond the msitools implementation). Do you know of any documentation for this?

mdsteele avatar Nov 02 '23 18:11 mdsteele

Hello! No, I found this result in a few examples, which I tried and which have the same issue. I will try to check msitools. But their code is harder to read :) Maybe you know some examples, of where that condition works?

Landay7 avatar Nov 02 '23 18:11 Landay7

I'm still very confused what's going on here.

I found another string pool reader implementation in this post; here's the relevant snippet:

    for (iSrc=1, iStringId=1; iSrc<nStringPoolLength; iSrc++) {
        DWORD dwLen = pStringPool[iSrc].wLength;
        if (pStringPool[iSrc].wLength == 0) {
            // A string is lagrer as 64K. In the case one create one dummy entry with pStringPool[iStringId].wLength
            // and high word of string length saved in the next entry will be saved in pStringPool[iStringId].wRefcnt
            if (pStringPool[iSrc].wRefcnt == 0) // empty entry
                iStringId++;
            continue;
        }
        if (iSrc != 1 && pStringPool[iSrc-1].wLength == 0 && pStringPool[iSrc-1].wRefcnt != 0)
            // current string have length over 64K
            dwLen += pStringPool[iSrc-1].wRefcnt << 16; //* 0x10000;
        ...

It seems to be doing the same thing as the msitools implementation I linked to before, namely: a long string is represented by two consecutive entries, the first with length=0 and refcount=(hi 16 bits of length), and the second with length=(lo 16 bits of length) and refcount=(the refcount). And that's exactly what the msi crate does too (unless there's a bug there that my eyes keep skipping over?).

mdsteele avatar Mar 15 '24 22:03 mdsteele

Probably, I need to build msitools and check if it works there :)

Landay7 avatar Mar 16 '24 01:03 Landay7

I think whatever the reason is, the lint error should probably be fixed by removing a space. Not sure why the testing workflow (arguably the more important one) is locked behind linting.

aaronliu0130 avatar Mar 16 '24 02:03 aaronliu0130

msitools works correct on SourcetreeEnterpriseSetup_3.4.15.msi, maybe there is some condition at their code, that I miss

Landay7 avatar Mar 24 '24 18:03 Landay7

Oh, interesting. Sounds like definitely a bug in this crate, then; we just need to figure out what msitools is doing differently that I keep not seeing. Probably something silly, but I haven't spotted it yet.

mdsteele avatar Mar 24 '24 20:03 mdsteele

Is the example working because the refcount and high word are both 1 by accident? Looks like lines 106&107 need swapping.

jan-g avatar Apr 19 '24 07:04 jan-g

It seems to be doing the same thing as the msitools implementation

That's the encoding function used by msitools. The decoding function seems to disagree with it though, as it expects the refcount of long strings to be in the same position as short strings (currently ignored by this PR, and used for the length low 16 bits by msitools' encoding function) and the next entry to contain the low 16 bits of the string followed by the high 16 bits (essentially the 32 bit length in little endian). I also found ReactOS source code and it also seems to agree. So I guess there is actually a bug in msitools' encoding function (edit: opened https://gitlab.gnome.org/GNOME/msitools/-/issues/60).

In the end this should be the proper fix:

--- a/src/internal/stringpool.rs
+++ b/src/internal/stringpool.rs
@@ -100,11 +100,9 @@ impl StringPoolBuilder {
         let mut lengths_and_refcounts = Vec::<(u32, u16)>::new();
         while let Ok(length) = reader.read_u16::<LittleEndian>() {
             let mut length = length as u32;
-            let mut refcount = reader.read_u16::<LittleEndian>()?;
+            let refcount = reader.read_u16::<LittleEndian>()?;
             if length == 0 && refcount > 0 {
-                length = ((refcount as u32) << 16)
-                    | (reader.read_u16::<LittleEndian>()? as u32);
-                refcount = reader.read_u16::<LittleEndian>()?;
+                length = reader.read_u32::<LittleEndian>()?;
             }
             lengths_and_refcounts.push((length, refcount));
         }

I tested it with the file provided by OP and it works.

SkiFire13 avatar Apr 19 '24 08:04 SkiFire13

@mdsteele I think that @SkiFire13 right. I adjusted due to his suggestion and tested with a few msi files and it looks okay. Please check also, I will run more tests on the weekend. Now it looks similar to decoding function

Landay7 avatar Apr 19 '24 10:04 Landay7

Ahhh, thanks so much for figuring that out. That does seem like a more reasonable encoding.

I think this PR still needs a corresponding change to StringPool::write_pool(). And ideally, a test showing that a long string can round-trip through encode/decode (i.e. the test would create a new package and put a long string somewhere in it, serialize the package to bytes, then re-parse the file and extract the long string).

mdsteele avatar May 04 '24 17:05 mdsteele