jackson-databind icon indicating copy to clipboard operation
jackson-databind copied to clipboard

Suggestion: Support value deduplication for enumeration like values

Open CodingFabian opened this issue 4 years ago • 5 comments

There are many use cases where the json to be deserialized contains pseudo enumeration values as string. Take a look at these examples

{ "type": "cat", "name": "Toby"} 

or

{ "categories": ["food", "vegetables", "green"], "name": "beans" }

if you build anything that wants to keep the deserialized Java objects in memory, you end up keeping multiple copies of the same string in memory. What you could then manually do is to deduplicate the values after you have deserialized the object:

  private static final ConcurrentHashMap<String, String> values = new ConcurrentHashMap<>();

  public static String deduplicateValue(String value) {
    return technologies.computeIfAbsent(value, v -> v);
  }
  
  public static deduplicateAnimal(Animal a) {
    a.setType(deduplicateValue(a.getType));
  }

it would be really sweet if Jackson would have an ability to be told that specific fields contain a (limited) number of enumeration like values and it would then deduplicate the string automatically.

We have processes where this saves gigabytes of memory and thus a lot of money too.

Alternative solution would be to use StringDeduplication feature of G1GC, but that might not be available. Also it comes with some extra cost.

CodingFabian avatar Jun 22 '21 08:06 CodingFabian

I think something like this would be best implemented as general-purpose post-processing extension; perhaps with one or two out-of-the-box implementations?

cowtowncoder avatar Jun 22 '21 16:06 cowtowncoder

the downside of post-processing is that the field needs to be mutable, ideally when my Animal is constructed, the type String argument is already deduplicated, so the field can be final.

CodingFabian avatar Jun 22 '21 16:06 CodingFabian

@CodingFabian Yes, if post-processing is for containing POJO type, although I was thinking of post-processing of actual String value. In that case String would be allocated but then canonicalized before it was assigned. So no mutability would be required.

But at any rate, I can see benefits of some support for canonicalization; it may be useful for Long (which has a small JDK-level reuse range already) and perhaps Double and BigDecimal.

cowtowncoder avatar Jun 24 '21 01:06 cowtowncoder

makes sense to me. I am wondering however if it could be a generic canonicalization, or if it depends on the use case and maybe would be better implemented as part of type specific deserialization. I lean to the latter, because the developer usually knows which fields contain values that make sense to deduplicate and which are mostly random / unique vales where this would be a waste of time.

CodingFabian avatar Jun 24 '21 06:06 CodingFabian

I think it should be possible to define this both for types (to apply to all properties of given declared type) and to specific properties -- especially for something like Strings. Most new configurability options are designed in a way to allow that.

cowtowncoder avatar Jun 25 '21 05:06 cowtowncoder