bitcode icon indicating copy to clipboard operation
bitcode copied to clipboard

Is it possible to encode recursive types?

Open artegoser opened this issue 1 year ago • 12 comments

image

image

image

In serde version there is error: type changed

artegoser avatar Aug 31 '24 17:08 artegoser

Recursive types are not implemented (edit: for Encode) and would be tricky to implement based on how [email protected] encodes nested values. You can use [email protected] instead, which works very differently and has a #[bitcode(recursive)] attribute for Encode.

Edit: for serde, see https://github.com/SoftbearStudios/bitcode/issues/35#issuecomment-2323156232

finnbear avatar Aug 31 '24 17:08 finnbear

Thx. So, is this feature not planned? Or will it be possible again someday?

artegoser avatar Aug 31 '24 18:08 artegoser

Not currently planned, but the documentation could be more clear about this so I'll leave the issue open.

finnbear avatar Aug 31 '24 18:08 finnbear

Can use 0.6 with serde if you remove serde untagged from your enums. Serde untagged breaks most binary formats even if some don't complain right away.

caibear avatar Sep 01 '24 04:09 caibear

Yes, it works, only the purpose in reducing the amount of data is not fulfilled. In this case enum identifier is written, which significantly increases the file size. I compared with dlhn, however it cannot decode this. The size becomes larger than my daletpack format, which does not compress well. Is it possible to implement serde(untagged) support in your case?

artegoser avatar Sep 01 '24 06:09 artegoser

Is it possible to implement serde(untagged) support in your case?

No, because #[serde(untagged)] works by trying all possible variants until one deserializes. No version of bitcode is self-describing, so it's possible that the wrong variant would deserialize without an error but the data would be corrupt.

Consider the case of #[serde(untagged)] enum { Big(u16), Small(u8) }. If you serialized Small(5) with serde_json, it would deserialize as Big(5). In bitcode, #[serde(untagged)] enum { Int(usize), Str(String) } could have the worse problem that bytes from a string encoding could be interpreted as an integer. Data corruption is not something we tolerate in bitcode, although most likely the entire message would fail to deserialize.

Making bitcode self-describing (per-instance, to support #[serde(untagged)]) would probably cost much more data than the enum variants that #[serde(untagged)] tries to omit.

The #[serde(untagged)] feature is better for textual formats like JSON, where different types have disjoint prefixes like [ or {.

The size becomes larger than my daletpack format

If you can provide specific details (schema + encoding) of how you can outperform bitcode on size, that would be interesting :eyes:

finnbear avatar Sep 01 '24 07:09 finnbear

If you can provide specific details (schema + encoding) of how you can outperform bitcode on size, that would be interesting 👀

This format is specifically designed for my schema, but it doesn't compress well (I'm not very knowledgeable about how to optimize this), so I'm looking for solutions that are already out there. It's not for all data types, so it won't help you.

For my format, I had the idea of exposing each type (if it's an enum) and writing a separate identifier for each one.

For example

enum Enum {
  First(Other),
  Second(Other)
}

enum Other {
  String(String),
  Num(u8)
}

Enum in this case has 4 binary identifiers for each type

Enum::FirstString Enum::FirstNum

Enum::SecondString Enum::SecondNum

I don't know how procedurally realizable this is, but it should work. I don't know if this structure compresses well either.

I partially implemented this in the daletpack scheme (but still eventually would like to do it for all types and procedurally)

artegoser avatar Sep 01 '24 07:09 artegoser

After running your benchmark it looks like various formats take ~300 bytes. I would recommend testing on a larger dataset if you want to accurately benchmark compression. Compression algorithms are generally inefficient in terms of size and speed on input smaller than tens of kilobytes.

Also zlib is just a wrapper of deflate.

caibear avatar Sep 01 '24 08:09 caibear

Hi ! I have a few types that contain serde_json::Valuess, and I also get a panic with type changed. I checked and it seems that serde_json is not using #[serde(untagged)]. What's going on ?

DontBreakAlex avatar Sep 13 '24 16:09 DontBreakAlex

I recently added a warning to the README about this; serde_json::Value has a custom Serialize implementation that is very similar to an untagged enum.

finnbear avatar Sep 13 '24 16:09 finnbear

Not only #[serde(untagged)], but also #[serde(tag = "type")] and #[serde(tag = "t", content = "c")] cause panic: type changed

artegoser avatar Mar 22 '25 19:03 artegoser

Thanks, I noted this on https://github.com/SoftbearStudios/bitcode/wiki/Serde

finnbear avatar Mar 22 '25 20:03 finnbear