calamine icon indicating copy to clipboard operation
calamine copied to clipboard

Byte buffers holding XLUnicodeString prematurely truncated

Open sftse opened this issue 5 months ago • 9 comments

This PR is another batch of commits from #463.

The first commit b5b948a6caa6c23708e2bc26d8fc1e322b7c4126 introduces a test that fails. 1c58893cff3066d10b00afd8caeeaede098420a0 fixes one of the mistakes, ~~so that a different error message appears. 599eae8ce442a922b3ccf8982471f91d143fac3f fixes the next failure due to a spec nonconformity of the test file, which seems like a logical extension of the standard.~~ Already fixed in upstreamed PR.

Edit: updated commit hashes after push

sftse avatar Jul 02 '25 09:07 sftse

Looks like a good patchset with tests and a sensible fix. I vote +1 for merge and if there is no dissent I will merge in a few days.

jmcnamara avatar Jul 03 '25 10:07 jmcnamara

Before merge I want to add the particular, spec-nonconformant XLUnicodeString as a unit test.

sftse avatar Jul 04 '25 14:07 sftse

Before merge I want to add the particular, spec-nonconformant XLUnicodeString as a unit test.

Okay. I'll wait for that. Let me know.

jmcnamara avatar Jul 04 '25 14:07 jmcnamara

While revisiting this to add a unit test, I noticed that the formula that is triggering this high byte string panic is also the one from row 43 col 9 that was the point of discussion here.

This makes me think there is more to this test than initially thought. The other thing I'm not able to piece together is that the XlsEncoding is windows-1252, so the high-byte being set for two byte characters doesn't make a lot of sense? The root cause for this error might be something else, and the patch does fixup an error, just a different one, while accidentally fixing the crash.

If there are any doubts about how the MS-XLS 2.5.296 says one should read an XLUnicodeStringNoCch, then we should hold off merging, if not I think we can merge? A testcase would be nice though, since I'm no longer certain the .xls file added even contains any.

sftse avatar Jul 06 '25 12:07 sftse

Actually, let's hold off on merge until we can investigate this further.

sftse avatar Jul 06 '25 12:07 sftse

While revisiting this to add a unit test, I noticed that the formula that is triggering this high byte string panic is also the one from row 43 col 9 that was the point of discussion here.

I created a test xls file called j44.xls that contains the formula =REPT("O",I44) in cell J44 like in the failing test file high_byte_string.xls.

j44.xls

This has the following binary data for the formula:

    17 01 00 4F 44 2B 00 08 C0 41 1E 00

I intrepreted this as:

17    ptgStr
0001  cch = 1 and fHighByte not set.
4F    string "O"

44    ptgRe = 0x04 +  PtgDataType = 0x2 = Value
002B  Row 43
08    Col  8 (combined to make I44)
C0    = 1100_0000b. Bit 8 indicates row is relative (not absolute). Bit 7 is the same for columns.

41    ptgFunc Value
001E  REPT

(As an aside, I said in another comment that C0 was the record "TOOLBAREND" but after a bit of detective work I re-reminded myself that the second byte of the column word is used to store 2 bits that indicate if the row or column are relative or absolute. I.e., A1 vs $A$1)

The formula in high_byte_string.xls looks like this:

   17 01 4F 44 2B C0 08 41 1E 00 

Comparing the 2 binary data sets:

high_byte: 17 | 01    4F | 44 2B C0 08    | 41 1E 00

j44:       17 | 01 00 4F | 44 2B 00 08 C0 | 41 1E 00

There are 2 odd things here.

  1. There is no fHighByte field in the the ShortXLUnicodeString. ~~This may be because, as you point out, the CODEPAGE is 1252 which might mean that all strings are simple single byte encodings so the encoding/fHighByte field can be omitted.~~ It is probably a XLUnicodeStringNoCch string as you point out.
  2. The row in the cell reference appears to be a single byte only. And the C0 token is in an odd place.

All in all, it looks like a junk formula except for the fact that Excel (including a couple of older versions that I have) reads it. This is one of the reasons I stopped working with the xls format once xlsx came along. :-)

We could maybe just fix the cch = 1 issue and not fix whatever this is.

Update 2: I think I understand the difference between the 2 ptgRef structures above.

The first in high_byte_string.xlsx probably uses an older 14 bit encoding for rows up to 16,383 in Excel 95 and earlier and uses bits 15-16 to encode the row/col relative/absolute. I couldn't find a 3-byte RgceLoc (2.5.198.109) variant but it is in the old paper copy I have. On the other hand both test files report Biff5 in the BOF so I don't know how a parser should distinguish between a 3 byte and 4 byte RgceLoc.

The second is the 4 byte RgceLoc (2.5.198.109) with the row/col relative/absolute bit in the column entry.

jmcnamara avatar Jul 06 '25 16:07 jmcnamara

@sftse I undated the previous comment to clarify what C0 is.

jmcnamara avatar Jul 06 '25 22:07 jmcnamara

@sftse let me know when this patchset ready for merge?

jmcnamara avatar Jul 07 '25 11:07 jmcnamara

Haven't forgotten about this, thanks for your review. Rebased, will revisit.

sftse avatar Oct 30 '25 16:10 sftse