binary icon indicating copy to clipboard operation
binary copied to clipboard

Instance for Double/Float is absolutely batty

Open ezyang opened this issue 10 years ago • 15 comments

Apparently, binary represents a Double as a tuple of (Integer, Int)? This means that doubles suffer a x3 or more size explosion, when really you could just record an IEEE floating point with the proper endian. This would also fix #64

Backwards compatibility might be a concern for fixing this, however.

ezyang avatar Feb 17 '15 19:02 ezyang

Unfortunately, when we picked the serialisation format 8 years ago we were not so careful, and the format has not changed since then. Backwards compatibility is the main concern for not fixing silly things like this issue.

If we do a revision of the API in the future that requires code changes in user code, then we could take the opportunity to also fix things like this.

kolmodin avatar Feb 17 '15 20:02 kolmodin

This is indeed annoying issue for code which need to serialize a lot of Floats/Doubles (scientific code for example). There're other suboptimal instances. Serialization of list forces its spine, empty list is serialized to 8(!) bytes

Another option is to introduce another type class with same API for serialization with well documented serialization format which fixes problems with current format and encourage people to switch to it.

Shimuuar avatar Jun 04 '15 17:06 Shimuuar

If you have a lot of Floats/Doubles, you should use an array of some kind, wrap it in a newtype and write your own more efficient Binary instance.

If your goal is just to serialize some data, you should be using @dcoutts 's upcoming CBOR work. The wire format is significantly better, and the implementation is way faster. That work may be included into the binary package, and it'll be exposed as a new type class plus functions to serialize/deserialize.

kolmodin avatar Jun 04 '15 18:06 kolmodin

Vector of doubles is not so bad. It's data structures containg doubles are problematic. Automatically derived instances will use instance for double. In the end I gave up and just gzipped everything after encoding.

Glad to hear.

Shimuuar avatar Jun 04 '15 18:06 Shimuuar

Is there an estimate of when this will be fixed? If the CBOR approach will not be ready in the foreseeable future, then we should begin a deprecation cycle. First, introduce a better alternative encoding as an opt-in, and deprecate the current encoding by recommend that everyone use the new one. Then, after a time, make the new encoding the default and leave the old format supported as opt-in.

See also #64 - that should also be fixed in the new encoding.

Note that people are publicly advocating abandoning this library for these reasons.

ygale avatar Feb 09 '16 13:02 ygale

My understanding is that the CBOR work is rather close to being ready although @dcoutts may be able to say more.

bgamari avatar Feb 09 '16 13:02 bgamari

My understanding is that the CBOR work is rather close to being ready although @dcoutts may be able to say more.

IMO, it's 'done' in the sense that it literally compiles and you can use it. It's not done in the sense it's nowhere near feature parity with binary, including things like Generic support, having documentation (it has zero), more tests, more instances, and catching edge cases like this, and tightening the performance (which is already factors better than binary, at least). Given the nastiness of this issue, it would probably be better to lean on the side of caution and get all this right first before it bites us again.

In other words: it's 80% done. So we only have to do the remaining 80% of the work.

(Edit: I say this since I've been working on improving this, but it's been slow and obviously I've been away the past few weeks)

thoughtpolice avatar Feb 10 '16 05:02 thoughtpolice

Then I think it would be good idea to create tickets for missing features. It would be clear what's missing to make library relse ready and maybe someone even sumbit patch

Shimuuar avatar Feb 10 '16 09:02 Shimuuar

As in #79 , Is there a road-map for a new version of binary? The breaking change is indeed a valid concern, but this is exactly what PVP designed for, isn't it?

winterland1989 avatar Sep 12 '16 07:09 winterland1989

Data.Conduit.Serialization.Binary also makes use of this. I had a byte full of 8 byte real (floats with double precision) and it kept complaining about "not enough bytes" when i was sure i had a round number of multiples of 8 !

flip111 avatar May 31 '17 20:05 flip111

If your goal is just to serialize some data, you should be using @dcoutts 's upcoming CBOR work.

Are you talking about this repository? https://github.com/well-typed/binary-serialise-cbor I couldn't find anything on dcoutts' personal repositories.

flip111 avatar May 31 '17 20:05 flip111

Yes, that's the correct one.

thoughtpolice avatar Jun 03 '17 21:06 thoughtpolice

Is somebody deciding whether or how to fix this problem? Not round-tripping properly is a bug not a feature for a serialization library, and I think this issue if not fixed should at least have a proper warning in the documentation to not give users unpleasant surprises.

Also, since binary is being packaged with ghc, it is not simple to just put a fix in a github fork and use it in one's project. At least it is not simple to do in a stack project.

The fix for this bug is very simple, but is there inertia/analysis-paralysis about backward-compatibility? That could be the main issue but there is no discussion about this.

jchia avatar Jun 17 '22 02:06 jchia

The fix for this bug is very simple, but is there inertia/analysis-paralysis about backward-compatibility? That could be the main issue but there is no discussion about this.

Indeed the backwards compatibility concerns do not help matters. Specifically, the documentation is not at all clear on what compatibility contract the provided instances follows and the nature of the serialisation format makes it essentially impossible to change the existing instances without breaking existing serialised data.

I think our two options here are either:

a. decide that compatibility isn't important, perform a major version bump and fix this and any other format known issues, or b. expose newtype Ieee{Float,Double} which has sensible instances

bgamari avatar Jun 17 '22 18:06 bgamari

I personally am fine with Option a and prefer it over both Option b and doing nothing, but if there is a concern about a backward compatibility issue where existing serialised data gets deserialised wrongly by existing user code with too-relaxed binary version upper bounds using the corrected Binary instances, here is an Option c that can partially address that concern, albeit in a more inconvenient way:

  1. Move old Binary instances of Float & Double out of Data.Binary into a new module (e.g. Data.Binary.Instances.Moribund).
  2. Define corrected Binary instances of Float & Double in another new module (e.g. Data.Binary.Instances.Incoming).
  3. Make a major-version-bump release (Now all existing code that uses the instances will fail to build and will need to be fixed by importing the desired instance. This way, existing, unmodified, user code will not accidentally deserialise old data using the new encoding.)
  4. Retire the old instances: After some time, remove the module with the old instances, make the new encoding the default requiring no additional import in addition to Data.Binary, and then make a major-version-bump release.

Meanwhile, document a policy about how encoding may (or may not) change w.r.t. version number changes, and document every encoding change in the change log of every release, announce the stages of the plan, and include in the Haddock clear warning to not use both the old and corrected instances.

Option b means that people have to use the newtype Ieee wrapper everywhere because no unwrapped Float or Double is safe from the old encoding since Binary instances for records containing Float or Double can be generically derived and end up using the old encoding.

Both Option b and doing nothing to me look a bit like the opposite of avoiding success at all cost. They emphasize the value of backward-compatibility with respect to the past in an unclear number of use cases over correctness with respect to perpetuity.

jchia avatar Jun 18 '22 02:06 jchia