redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

serde: require explicit compat_version

Open jcsp opened this issue 2 years ago • 6 comments

Cover letter

This addresses an underlying cause of the bug in https://github.com/redpanda-data/redpanda/pull/7312

Previously, serde defaulted to assuming an envelope was only compatible with its own latest version, which caused bugs when the encoding of a structure was updated, and the developer forgot to add a compat_version indicating that the struct should still be able to decode the previous version.

Backport Required

  • [x] not a bug fix
  • [ ] issue does not exist in previous branches
  • [ ] papercut/not impactful enough to backport
  • [ ] v22.3.x
  • [ ] v22.2.x
  • [ ] v22.1.x

UX changes

None

Release notes

  • none

jcsp avatar Nov 16 '22 15:11 jcsp

Looks good to me

graphcareful avatar Nov 16 '22 16:11 graphcareful

Rebased on the fixes. When they merge, this gets rebased again, then need to eyeball the locations we fixed to make sure this PR doesn't accidentally revert any fixes.

jcsp avatar Nov 16 '22 17:11 jcsp

do i understand correctly that the issue here is that compat_version defaults to version? if so, it would be sufficient to change the default behavior? i'm not arguing against this patch which forces it to be explicit, just checking my understanding.

True that we could default to zero (i.e. by default struct can read any older version) -- I went with making it explicit because otherwise developers just won't have the concept of a compat version in their head until something goes wrong.

jcsp avatar Nov 16 '22 21:11 jcsp

True that we could default to zero (i.e. by default struct can read any older version) -- I went with making it explicit because otherwise developers just won't have the concept of a compat version in their head until something goes wrong.

+1

dotnwat avatar Nov 16 '22 21:11 dotnwat

this is a major oversight. where did we have a unit testing gap @felixguendling? (presumably we would have caught this in our corpus-based testing).

@graphcareful does this mean that there is a gap in testing for the rpc evolution test you wrote a while back?

dotnwat avatar Nov 16 '22 21:11 dotnwat

This is good to go

jcsp avatar Nov 22 '22 10:11 jcsp

this is a major oversight. where did we have a unit testing gap @felixguendling? (presumably we would have caught this in our corpus-based testing).

I think this was nothing we would find via unit testing. It was a bad library design decision to enable the user to default the compat_version. This was not according to the principle "make it easy to use correctly and hard to use incorrectly". Corpus-based testing should've caught this.

felixguendling avatar Nov 24 '22 09:11 felixguendling

bad library design decision to enable the user to default the compat_version.

perhaps, but i think it was more that the default wasn't 0 that was the problem. so kinda curious about that. in any case, seems we are back in business now :)

dotnwat avatar Nov 28 '22 23:11 dotnwat