jackson-databind icon indicating copy to clipboard operation
jackson-databind copied to clipboard

Add serialization feature: strict numbers

Open michaelhixson opened this issue 8 years ago • 22 comments

This is a proposal for a new serialization feature, SerializationFeature.STRICT_NUMBERS.

This feature would be disabled by default. With this feature enabled, the following actions would throw JsonProcessingException:

  1. Attempting to serialize +Infinity, -Infinity, or NaN
  2. Attempting to serialize a long that cannot be represented as a numerically-equivalent double

+Infinity, -Infinity, NaN

Non-finite numbers cannot be represented in JSON. Jackson's current behavior is to serialize them as strings. This "works" if Jackson is on both ends (serializing and deserializing) but not necessarily when a different actor is deserializing. In particular, if client-side JavaScript is deserializing, the current behavior masks a class of bugs that might be difficult to detect -- putting strings where the client expects numbers.

Checking whether a number is finite is extremely low-cost (Double.isFinite(x)) so this part of the feature should not have a noticeable performance burden.

Personally, I'm not intentionally serializing non-finite values anywhere and I'd prefer it if Jackson would throw an exception if I did so by mistake.

Longs that don't fit

Edit: I mistakenly called this a limitation of numbers in JSON, when it's actually about numbers in JavaScript.

Not all long values can be represented as 64-bit floating point numbers (double), which is the only number type in ~~JSON~~JavaScript. For example, casting these numbers to double in Java will result in loss of information:

  • 2^63 - 1
  • 2^53 + 1

All long values whose absolute value is <= 2^53 can be represented exactly as double, so there is a fast path for validation of most common long values. This will likely include all values that come from epoch millisecond timestamps or SQL auto-incrementing IDs. Randomly generated long values, however, would likely require the slow path (if there is a slow path) and be invalid.

All short and int values are smaller than 2^53, so this feature should add no performance burden for them.

Personally, I am accidentally serializing some long values that cannot be represented accurately as numbers in ~~JSON~~JavaScript. Woops! I didn't notice until I implemented the strict numbers feature for myself.

michaelhixson avatar Aug 29 '15 08:08 michaelhixson

Makes sense to me. The only question I have is whether the two should be bundled together; checks for non-numbers (inf, NaN) are quick to make. If the quick check for longs is all there is to it, that's not an issue I guess. But would there be need to check for rare (?) cases of larger long values that are, nonetheless, legal? If my memory of IEEE numbers serves me, as long as lowest bits are zeroes, scaling would actually allow accurate rendition of some of long numbers. But then again, perhaps quick check should suffice for the main purpose; if someone really wants more sophisticated checks, they could implement it themselves.

Now: beyond requirements, one open question would be where this would be implemented: in databind (by serializers), or at JsonGenerator. This is a trade-off, with some pros and cons for either:

If done at databind, all long/Long/AtomicLong/long[]serializers would need to do the check; andjackson-module-afterburner` would need to honor same checks if and when overriding handling. This is not a huge thing to do, so it may be the way to go.

Conversely, if done at streaming JsonGenerator, serializers would not need to care, and compatibility with Afterburner would work more reliably. But the major downside now is that every single dataformat module would need to apply the same checks, and this would lead to major copying of code (or coordination with helper methods).

Having written above, my first guess is that this should be handled at databind level.

cowtowncoder avatar Sep 08 '15 23:09 cowtowncoder

Re: performance of checking longs, I did some more digging and came up with a couple of implementations that are very fast. They're in the nanosecond range.

I actually have an open StackOverflow question about them: http://stackoverflow.com/questions/32411050/this-jmh-benchmark-is-inconsistent-across-machines-why

The two implementations:

public static boolean cast(long x) {
  return x == (long) (double) x
      && x != Long.MAX_VALUE;
}

public static boolean zeros(long x) {
  long a = Math.abs(x);
  int z = Long.numberOfLeadingZeros(a);
  return z > 10 || Long.numberOfTrailingZeros(a) > 10 - z;
}

Re: where this would go, I left a comment in Issue #912 that JsonGenerator seemed more natural. I came to the same conclusion here, but you have a much better understanding than I do of the potential "blast radius" of this code.

michaelhixson avatar Sep 09 '15 02:09 michaelhixson

NaN and ±Infinity should just be serialised as literal “null” instead. I’ve been searching for about half an hour for how to get Jackson to do that, i.e. behave as per the ECMAscript/JSON standard, already, with no success so far…

mirabilos avatar Aug 11 '16 12:08 mirabilos

@mirabilos What should be output is really debatable; null is one possibility but there are many others (throw exception; just output non-standard variant). Challenge with coercion to null is that this can hide legitimate problems.

On short term I would recommend you implement and register custom serializers for types you need (double, float, Double, Float) and add expected handling.

cowtowncoder avatar Aug 11 '16 18:08 cowtowncoder

@michaelhixson Re-reading the issue again, I believe there should be multiple separate features, possibly:

  • One for allowing (or not) of full 64-bit longs that do not with in IEEE double: this to support Javasacript use case and compatibility.
    • If attempt to write illegal number made, would throw generation exception indicating problem
    • Feature has been requested by others as well.
  • One for allowing (or not) writing of NaN numbers
    • If attempt to write illegal number made, would throw generation exception indicating problem
  • One for coercion of illegal numbers into null (as per @mirabilos suggestion): if enabled AND "can write NaN" enabled, would write JSON null instead of serialization as number (quoted or not).

Also: I think these should be implemented as JsonGenerator.Features, mostly to keep serializers simple. There will be some complications -- for example, coerced null will not be filtered out with JsonInclude.Include.NON_NULL, since Java value is not null, only serialization -- but overall I think that would be more reliable way to ensure that constraints are enforced.

cowtowncoder avatar Aug 11 '16 18:08 cowtowncoder

Tatu Saloranta dixit:

@mirabilos What should be output is really debatable; null is one @possibility but there are many others (throw exception; just output @non-standard variant). Challenge with coercion to null is that this @can hide legitimate problems.

Yes, but the ECMAscript JSON standard requires it, and as such, it ought to be the default, or at the absolutely very least, an easily configurable setting.

On short term I would recommend you implement and register custom serializers for types you need (double, float, Double, Float) and add expected handling.

That’s somewhat beyond me at the moment… Jackson is only used as part of jaxrs.

We ended up changing “double” to “Double” and doing something like:

dto.setFoo(Double.isFinite(foo) ? foo : null);

bye,

//mirabilos

“It is inappropriate to require that a time represented as seconds since the Epoch precisely represent the number of seconds between the referenced time and the Epoch.” -- IEEE Std 1003.1b-1993 (POSIX) Section B.2.2.2

mirabilos avatar Aug 11 '16 18:08 mirabilos

@mirabilos That works too if you can control POJOs.

Anyway, I agree with the notion that there should be a way to avoid producing non-standard output. In other respects Jackson does this and requires enabling specific features to force non-standard output.

cowtowncoder avatar Aug 11 '16 18:08 cowtowncoder

@cowtowncoder Multiple features sounds good to me.

Also: I think these should be implemented as JsonGenerator.Feature

So for the NaN = exception one, that would involve changes in this file, right? https://github.com/FasterXML/jackson-core/blob/12cea0336d6a71c5b8f843a3a500c0d72bc5cf2a/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java#L858-L866

In other words, I filed this issue in the wrong Jackson sub-project? It's really a jackson-core issue?

Regarding the longs fitting in doubles... I'm not as sure as I used to be about the solution. I actually wrote a blog post about this exact problem and ways to address it. Having an option to encode all longs as strings might be a better solution (not all numbers, just longs specifically). In any case, I think I'd like to withdraw that feature request.

michaelhixson avatar Aug 12 '16 05:08 michaelhixson

Any progress here? I'm quite interested in the "support the JSON specification" model of casting NaN into null.

ECMAScript specification for JSON.stringify states

NOTE4: Finite numbers are stringified as if by calling ToString(number). NaN and Infiinity regardless of sign are represented as the String null.

drekbour avatar Dec 18 '17 14:12 drekbour

@drekbour The usual practice is that if there is work being done, notes are added to the issue. Lack of notes suggests (correctly) that nothing has been done. I am not aware of anyone working on this; but if it was to be added, it would need to go in Jackson 3.0 as it is API addition.

As to handling of NaN: I would not consider Javascript spec to be canonical JSON Specification in any sense of the word, but it sounds like useful background information on considering possible ways to handling thigs. However, I would think that serialization of a number as String value would be quite confusing and creating perhaps more problems than it solves, at least in Java context. This is different from Javascript (and many other scripting langs) where coercions often exist to allow more seamless change across types..

cowtowncoder avatar Dec 19 '17 01:12 cowtowncoder

We came across this because the sender is serialising a double and a receiver deserialising the same as a BigDecimal. Is you point out, the string "NaN" instead of a primitive double is not helpful

drekbour avatar Dec 23 '17 10:12 drekbour

One other note, regarding requirements: what ECMAscript does is not definitive in any sense of the word: Javascript is simply one of many, many languages/platforms that use JSON, with no special stature amongst them, except for its wide usage. I do not, essentially, care deeply whether it has trouble with 64-bit longa (as in: it does -- that's platform limitation, not JSON limitation).

That said, I am open for additional features to help sub-standard platforms, such as ECMAScript, that are unfortunately "number-challenged". But they will not and should not define general limits since majority of languages/platforms are able to deal with larger sets of floating-point AND integer numbers; some up to unlimited (Java's BigInteger / BigDecimal, as just one example).

So: if some limits to handling of 64-bit longs are proposed, a separate request for Features to add should be filed.

One other implementation idea: instead of modifying jackson-databind (or streaming API), perhaps better way would be to go with a full Module (jackson-lang-javascript project?). Module would then register custom (de)serializers for relevant types; and even if it might require smaller changes to databind (or not, hard to say in advance), it would be more flexible in its dealings. It would also not affect other usage but just apply to use cases where Javascript client / server compatibility is very important.

cowtowncoder avatar Jan 15 '18 21:01 cowtowncoder

“what ECMAscript does is not definitive in any sense of the word”

This is not quite right. JSON is a proper subset of ECMAscript except that it allows two more chars in string literals (which all modern encoders just emit as Unicode escapes, for JS compatibility), but JSON is defined in the ECMAscript standard, in a normative annex, and this is the entire and exact JSON specification. All other implementations must honour what is written there; only for things not specified there, they may diverge.

mirabilos avatar Apr 05 '18 13:04 mirabilos

No, JSON has no real dependency to Javascript. While ECMAScript folks try to define their understanding of JSON, that is no more definitive than the original RFC; and limitations of ECMAScript certainly have no relevancy to its use outside of that platform.

So while it is true that ECMAScript specs also define JSON, that does not somehow make JSON subset or subservient to it. To me it means more that they specify handling details in that context. And on Java platform the only concern would be wrt inter-operability.

I mean Oracle could of course similarly specify their understanding of JSON, but it would be foolish to expect others to start considering that somehow general definitive view of JSON and limitations. I would not be pushing that definition on issues for, say, node.js, except if there are some aspects of interoperability that could be improved.

cowtowncoder avatar Apr 09 '18 04:04 cowtowncoder

On Sun, 8 Apr 2018, Tatu Saloranta wrote:

So while it is true that ECMAScript specs also define JSON, that does not somehow make JSON subset

Agreed, except for the part in which the ECMAscript spec defines JSON in terms of ECMAscript (some parts of it; others, such as the string definition, aren’t).

mirabilos avatar Apr 09 '18 10:04 mirabilos

@mirabilos Ok yes, agreed. Its definition is not limited to ECMAScript in that sense, or include concepts only found there.

cowtowncoder avatar Apr 09 '18 14:04 cowtowncoder

NaN and ±Infinity should just be serialised as literal “null” instead. I’ve been searching for about half an hour for how to get Jackson to do that, i.e. behave as per the ECMAscript/JSON standard, already, with no success so far…

ended up using Annotation JsonComponent to do the customization

zinking avatar Apr 15 '20 15:04 zinking

Custom serializer is the way to go for different representations of NaNs of various kinds. Just remember to register for both primitive double type (Double.TYPE) and wrapper Double type (Double.class).

cowtowncoder avatar Apr 15 '20 16:04 cowtowncoder

About,

Not all long values can be represented as 64-bit floating point numbers (double), which is the only number type in JavaScript.

Say like this, it sounds this is a language specific issue and I could understand that this is maybe not the Job of a JSON library to add specific feature for every language.

But reading The JavaScript Object Notation (JSON) Data Interchange Format - RFC8259 § 6. Numbers

This specification allows implementations to set limits on the range and precision of numbers accepted. Since software that implements IEEE 754 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision. A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

So maybe we can read it into "To maximize interoperability we advice to limit range and precision of numbers to IEEE 754 binary64 (double precision)"

Not directly linked, but note that also SenML-CBOR has this kind of advice :

For JSON Numbers, the CBOR representation can use integers, floating-point numbers, or decimal fractions (CBOR Tag 4); however, a representation SHOULD be chosen such that when the CBOR value is converted to an IEEE double-precision, floating-point value, it has exactly the same value as the original JSON Number converted to that form. ... .... (source : https://www.rfc-editor.org/rfc/rfc8428#section-6)

Just to say that maybe we could see that more like "a common way to maximize interoperability" rather than "a specific language limitation" ?

sbernard31 avatar Sep 30 '22 09:09 sbernard31

I don't think the main issue here is wordsmithing. What help is changing wording here?

cowtowncoder avatar Oct 01 '22 00:10 cowtowncoder

What help is changing wording here?

I'm sorry if my comment was not helpful enough. I vaguely understand there was questions about where to put code about those features in jackson. Some jackson-lang-javascript module idea was proposed. I don't know jackson code organization enough to give concrete help about that ... but I was just thinking that knowing if a feature is "pure JSON" or a "specific language" feature could help to know where the code should be put. And my point was the strict IEEE double-precision limitation for numbers is maybe more a pure JSON feature.

sbernard31 avatar Oct 03 '22 08:10 sbernard31

Ah ok @sbernard31 that makes sense. I agree. I think I misread your comment there, thank you for clarification.

cowtowncoder avatar Oct 03 '22 23:10 cowtowncoder