jackson-databind
jackson-databind copied to clipboard
Suggestion: Support value deduplication for enumeration like values
There are many use cases where the json to be deserialized contains pseudo enumeration values as string. Take a look at these examples
{ "type": "cat", "name": "Toby"}
or
{ "categories": ["food", "vegetables", "green"], "name": "beans" }
if you build anything that wants to keep the deserialized Java objects in memory, you end up keeping multiple copies of the same string in memory. What you could then manually do is to deduplicate the values after you have deserialized the object:
private static final ConcurrentHashMap<String, String> values = new ConcurrentHashMap<>();
public static String deduplicateValue(String value) {
return technologies.computeIfAbsent(value, v -> v);
}
public static deduplicateAnimal(Animal a) {
a.setType(deduplicateValue(a.getType));
}
it would be really sweet if Jackson would have an ability to be told that specific fields contain a (limited) number of enumeration like values and it would then deduplicate the string automatically.
We have processes where this saves gigabytes of memory and thus a lot of money too.
Alternative solution would be to use StringDeduplication feature of G1GC, but that might not be available. Also it comes with some extra cost.
I think something like this would be best implemented as general-purpose post-processing extension; perhaps with one or two out-of-the-box implementations?
the downside of post-processing is that the field needs to be mutable, ideally when my Animal is constructed, the type String argument is already deduplicated, so the field can be final.
@CodingFabian Yes, if post-processing is for containing POJO type, although I was thinking of post-processing of actual String value. In that case String would be allocated but then canonicalized before it was assigned. So no mutability would be required.
But at any rate, I can see benefits of some support for canonicalization; it may be useful for Long (which has a small JDK-level reuse range already) and perhaps Double and BigDecimal.
makes sense to me. I am wondering however if it could be a generic canonicalization, or if it depends on the use case and maybe would be better implemented as part of type specific deserialization. I lean to the latter, because the developer usually knows which fields contain values that make sense to deduplicate and which are mostly random / unique vales where this would be a waste of time.
I think it should be possible to define this both for types (to apply to all properties of given declared type) and to specific properties -- especially for something like Strings. Most new configurability options are designed in a way to allow that.