jackson-dataformats-binary icon indicating copy to clipboard operation
jackson-dataformats-binary copied to clipboard

Issue in Smile shared string values due to avoiding 0xFE and 0xFF

Open vooft opened this issue 1 month ago • 5 comments

Hello!

I'm working on my own smile serializer/deserializer and, I think I found an issue in the Jackson implementation that does not match the specification.

https://github.com/FasterXML/jackson-dataformats-binary/blob/1e4017a99143f0c6eb1ce86af4e144a2eebd1c0c/smile/src/main/java/com/fasterxml/jackson/dataformat/smile/SmileGenerator.java#L2732-L2743

According to the spec, if a (potentially) shared string is seen for the second time, only the first index must be used in a lookup table:

Note: when a shared String is written or parsed, no entry is added to the shared value buffer (since one must already be in it)

This code would skip particular indexes (for example, 254), but when the same string appears for a second time, a new (different) index will be assigned, which may not be the same as on the decoder side, unless the same skipping logic is replicated.

Not sure what is the best solution here, perhaps explicitly forbid specific indexes in the backreference lookup in a spec?

vooft avatar May 15 '24 14:05 vooft

Hmmh. Not sure I understood the exact description but I think that:

  1. Decoder must use same index as encoder (obviously, otherwise result is wrong)
  2. There should be no stipulation that everything that could be shared must be shared -- it is legal (if not optimal) to keep on adding Strings instead of using back-reference (otherwise lookup structures must be able to keep all N referencable Strings)
  3. Yes, if not mentioned, preventing use of specific indexes should be documented.

This should probably be tackled at:

https://github.com/FasterXML/smile-format-specification/

cowtowncoder avatar May 15 '24 17:05 cowtowncoder

Sorry for not being clear.

The problem here in particular is due to skipping the index and not remembering which value was skipped.

For example:

  1. Value "test" happened to be 254th, the encoder wrote it as tiny ascii, but it was just not added to the back-referencing table.
  2. On the decoder side it was recorded in the back-referencing table as normal.
  3. There is no issue yet at the moment, because the next index for both encoder and decoder is still 255.
  4. If we never see "test" again, everything is fine, because the only occurrence was a single tiny ascii token and we never try to resolve the shared value number 254.
  5. However if we do see a "test" again, encoder and decoder may do different things:
    1. Encoder will write it again as a tiny ascii, but it will also assign a new index (in my case it was 307).
    2. Decoder sees tiny ascii "test" again and assumes this is just a non-shared repeated token and will not create a new index (as in point 2 in your answer).
  6. From now on, any new indexes will differ by 1 in encoder vs decoder.

IMO the best option would be to document the current Jackson behavior in the spec (i.e. skip first occurrence, remember the second one), since everyone is using Jackson as a reference anyways.

Another option would be implementing a logic to remember which strings were skipped on the encoder side and never try to share them again. In my example, since "test" was skipped the first time, from now on it will only be written as a tiny ascii and never as a shared value. From compatibility perspective, this is probably a better option.

Shall I create another issue in https://github.com/FasterXML/smile-format-specification/ and close this one?

vooft avatar May 16 '24 05:05 vooft

@vooft Ok. Right: so the logic should indeed be that EVERY String written out occupies "slot", but some of these (254) are NOT allowed to be back-referenced by encoder. For decoder it should not really matter, it simply should not see any such references. It may keep it in decoder table, or (if it wants to super-optimize :) ) not store. But it has to consider that index taken.

And in fact this behavior is how it has to be: otherwise value 254 would not be avoided.

However: I don't think "test" value per-se could not be referenced, only slot #254 must be avoided. If and when it is repeated, following occurrence could be referenced. I also do not think specification should strictly prohibit duplicates in encoding table -- ideally no duplicates should be kept, but some encoders could allow that by necessity (hash tables with collisions and no overflow chains -- i.e. imperfect set of references kept).

So yes, I think I agree: current behavior should be clarified.

Does above make sense?

cowtowncoder avatar May 17 '24 01:05 cowtowncoder

Yes, absolutely, thank you for the clarification!

Just a thought: wouldn't it be simpler just to skip those invalid indexes and pick a next valid one? Like, in this case 254 and 255 must be skipped, but "test" could be remembered as 256 on the first occurrence.

vooft avatar May 17 '24 05:05 vooft

If I was to specify behavior from the scratch, yes, but for backwards compatibility reasons (existing Smile encoded data; earlier versions), that would be majorly backwards-incompatible change -- if I understand the suggestion correctly.

So I don't think that can be done, even tho it is a good idea on its own, if we didn't have to worry about existing usage and codec versions.

cowtowncoder avatar May 17 '24 22:05 cowtowncoder