kotlinx.serialization icon indicating copy to clipboard operation
kotlinx.serialization copied to clipboard

Mechanism to filter/transform JSON primitives during decoding

Open pschichtel opened this issue 7 months ago • 11 comments

What is your use-case and why do you need this feature?

I have a web application and I would like to reject strings that contain control characters (\u0000 to \u001F excluding whitespace) while decoding. A solution would be to first parse the JSON into a JsonElement, analyse that and then decode the JsonElement into the target data structure, but that incurs a lot more time and memory overhead.

Describe the solution you'd like

I had three ideas on how this could be approached:

  1. Introduce some form of "primitive listener" interface, that can be configured at the Json instance, that gets invoked for each primitive and has the chance of returning the same primitive, a new one or throwing an exception.
  2. Allow registering custom KSerializers for the primitives types that would be called when e.g. invoking Decoder.decodeString() from other KSerializers.
  3. Allowing to wrap the Decoder passed into KSerializers

pschichtel avatar May 16 '25 12:05 pschichtel

I do not think this is feasible with the current architecture, because Decoder.decodeString() was designed to get a primitive with minimal overhead, so it doesn't use KSerializer or any listeners. The only alternative I can imagine is to write your own implementation of Json with all the required extension points and delegate to the kotlinx serialization's one when possible. Or, make an alternative KSerializer for String and use it globally via typealias.

sandwwraith avatar May 19 '25 10:05 sandwwraith

class StringValidatingDeserializationStrategy<T>(private val realDeserializationStrategy: DeserializationStrategy<T>) : DeserializationStrategy<T> {
    override val descriptor: SerialDescriptor = realDeserializationStrategy.descriptor

    override fun deserialize(decoder: Decoder): T {
        val decoder = if (decoder is JsonDecoder) {
            StringValidatingJsonDecoder(decoder)
        } else {
            StringValidatingDecoder(decoder)
        }
        return realDeserializationStrategy.deserialize(decoder)
    }

    class StringValidatingDecoder(private val realDecoder: Decoder) : Decoder by realDecoder {
        override fun decodeString() = realDecoder.safeDecodeString()
    }

    @OptIn(SealedSerializationApi::class)
    class StringValidatingJsonDecoder(private val realDecoder: JsonDecoder) : JsonDecoder by realDecoder {
        override fun decodeString() = realDecoder.safeDecodeString()
    }

    private companion object {
        private fun Decoder.safeDecodeString(): String {
            return decodeString().also { s ->
                if (s.any { it.isISOControl() && !it.isWhitespace() }) {
                    throw SerializationException("Illegal control character in string!")
                }
            }
        }
    }
}

inline fun <reified T> Json.safeDecodeFromString(@Language("json") s: String): T {
    return decodeFromString(StringValidatingDeserializationStrategy(serializer()), s)
}

@Serializable
data class TestClass(val test: String)

val a = run {
    val j = Json { }

    j.safeDecodeFromString<TestClass>("""{"test": "\u0000"}""")
}

would something like that be viable, or do deserializers perform type-checks on the decoder?

pschichtel avatar May 19 '25 10:05 pschichtel

@pschichtel You seem to confuse decoders and deserialization strategies. A decoder is independent of the data types/serializers and is responsible for parsing a document and using the deserialization strategy (serializers) to create object instances. (De)serialization strategies are specific to a type and are responsible for (de)serializing those through appropriate invocations of the decoder (or encoder) when called.

As to validation, the easiest way is to have a custom "ValidatingStringSerializer" that is adds a validation step to the deserialize step. The alternative (that doesn't require a separate serializer) is to have a format that does this for you even for the standard serializer. But there you need to keep in mind that various intermediate decoders and compositedecoders might be created (and need wrapping) - you didn't do this in StringValidatingDecoder.

pdvrieze avatar May 19 '25 12:05 pdvrieze

You seem to confuse decoders and deserialization strategies. A decoder is independent of the data types/serializers and is responsible for parsing a document and using the deserialization strategy (serializers) to create object instances. (De)serialization strategies are specific to a type and are responsible for (de)serializing those through appropriate invocations of the decoder (or encoder) when called.

yeah I'm not confusing that. I have to intercept the decodeString() decoder call here, but in order to do that I have to intercept DeserializationStrategies to wrap the Decoder with my extra logic.

As to validation, the easiest way is to have a custom "ValidatingStringSerializer" that is adds a validation step to the deserialize step.

how would that look like? Do you mean having a value class ValidatingString(val value: String) that has a custom KSerializer that does this? I'd prefer not having to replace all my strings with a custom type. I also don't want to litter my codebase with @Serializable(with = MyStringSerializer::class) on all string, just to forget it in some places.

But there you need to keep in mind that various intermediate decoders and compositedecoders might be created (and need wrapping) - you didn't do this in StringValidatingDecoder.

can you elaborate on this? Or do you think this is infeasible?

pschichtel avatar May 19 '25 12:05 pschichtel

I've updated the snippet to wrap the decoder as a JsonDecoder if it was one, to support potential type checks downstream.

I don't currently see any other extension points I could use to support this. I can't just reimplement the Json.decodeFromString method (and friends), because all of the things it requires are marked internal. If intermediate decoders are being created during deserialization, then I could also recursively wrap all deserialization strategies by also intercepting decodeSerializableValue (and friends).

pschichtel avatar May 19 '25 13:05 pschichtel

@pschichtel If you have a StringValidatingDeserializationStrategy it would basically be a serialization strategy that almost mirrors the StringSerializer. That strategy to decode it would just call decoder.decodeString(), the format will get you the string. Then you can add some validation before you return the string as the result of parsing. Encoding could skip the validation and just call encoder.encodeString(value) directly.

What this does not do is intercept Strings that are not annotated to use this serializer. If you want to support those, you want to change the decoder instead (but don't need to bother with a DeserializationStrategy - you may want to have a "ValidatingJson" format though that wraps Json but then does validation on all strings).

pdvrieze avatar May 19 '25 18:05 pdvrieze

Yeah the latter is what I want and I would be perfectly fine with that, but I'm not clear on how I would setup such a ValidatingJson without reimplementing significant parts of Json? Are there example of something similar? Or could you provide some pseudo code?

pschichtel avatar May 19 '25 18:05 pschichtel

Just to clarify https://github.com/Kotlin/kotlinx.serialization/issues/3004#issuecomment-2890520135, because I think there is some misunderstanding:

The intention there was always to customize the Decoder, not the DeserializationStrategy. I'm just wrapping/intercepting the "root" DeserializationStrategy to get access to the Decoder in order to have that wrapped/intercepted for all downstream DeserializationStrategies.

pschichtel avatar May 19 '25 18:05 pschichtel

@pschichtel Basically you will need to have a wrapping class (could perhaps be combined)) for both Decoder and CompositeDecoder. Then you must override Decoder.beginStructure to wrap the delegated CompositeDecoder. Then for the encodeSerializableElement function you must implement it by a wrapping serializer that can then call the delegate with a wrapped Decoder. Of course you also need to adjust decodeString and decodeStringElement

pdvrieze avatar May 19 '25 19:05 pdvrieze

@pdvrieze wouldn't the CompositeDecoder eventually call into my wrapped Decoder anyway?

pschichtel avatar May 24 '25 20:05 pschichtel

@pdvrieze wouldn't the CompositeDecoder eventually call into my wrapped Decoder anyway?

What happens is that in encodeSerializableElement a decoder is created/retrieved (it can be the same unwrapped composite encoder in cases), the thing is that it you need to capture and wrap the encoder - using the wrapping serializer does that.

pdvrieze avatar May 24 '25 20:05 pdvrieze