apischema icon indicating copy to clipboard operation
apischema copied to clipboard

Should serialization check type by default?

Open wyfo opened this issue 2 years ago • 6 comments

With Cython, performance penality caused by serialization type checking is less an issue. apischema is already fast enough to afford this feature as default behavior, as it will stay a lot faster than competitors.

With serialization type checking by default, disabling it could then become an option to improve performance.

wyfo avatar Oct 31 '21 15:10 wyfo

Hi @wyfo,

Data classes direct serialization is done by orjson like a charm. Type checking is important at some earlier stages.

  • In most cases, API responses don't have validation of values but should check typical mistakes, like an assigned value type, because this is handy for the development process.
  • In most cases, response payload size is much larger than request, so looks like this should be a focus.

Please check how just Pydantic's type-checking impacts performance in real-like scenarios. In this benchmark, we use data classes decorated by Pydantic. If the validation decorator is changed to built-ins we have serialization speed equal to primitives (raw benchmarks).

https://github.com/mtag-dev/py-rest-stress-testing/tree/squall_integration

stas-dubrovskyi avatar Nov 24 '21 22:11 stas-dubrovskyi

Data classes direct serialization is done by orjson like a charm.

Actually, not really, because orjson only works with raw dataclasses and doesn't cover apischema features like aliasing (for example camelCase), (conditional) field skipping, serialized methods, field ordering, etc. There are some issues opened in orjson repository, for example https://github.com/ijl/orjson/issues/45 about aliasing, but I don't know if they will ever be implemented (and I personally think they should not, it's kind of a slippery slope).

For simple dataclasses however, orjson is indeed fine. By the way, apischema has a passthrough feature that allow arbitrary types like dataclasses to be handled directly by the JSON serializer. Moreover, in the original design of this feature, there was an option for dataclass that allows to automatically pass through dataclasses that use no apischema specific feature. This option has been removed before the release to simplify the code in anticipation of the cythonisation. That option should come back in the next release, even if the performance of the new compiled code makes it less useful: apischema (without dataclass passthrough) + orjson is almost as fast than orjson only for simple dataclasses.

In most cases, API responses don't have validation of values but should check typical mistakes, like an assigned value type, because this is handy for the development process.

For my part, I would more rely on static type checking, but I admit it's quite limited on Python.

Please check how just Pydantic's type-checking impacts performance in real-like scenarios.

Actually, pydantic and apischema are no longer really comparable in terms of performance since the cythonization (coming with the next release). apischema is indeed about 20x faster than pydantic in serialization (using its own benchmark), without type checking; with type checking activated, it still more than 17x. Apischema is very optimized (even pure Python code is still 3x faster than cythonized pydantic code), and so is type checking in serialization.

In the end, I didn't understand your opinion on this issue. Do you think type checking should be the default?

wyfo avatar Nov 24 '21 23:11 wyfo

I did not think about it when I was writing the issue, but the best solution could be to have simply apischema.settings.serialization.type_checking = __debug__ by default. It would disable type checking in optimized/production mode (and we can see type checking as runtime assertion, so they would be disabled with other assertions of the code) and enable it in debug/development mode. Best of two worlds.

wyfo avatar Nov 25 '21 00:11 wyfo

@wyfo all depends on the stage. Options:

  • If it can highlight problems during instance creation, like for example does the strongtyping. It will be handier to present by default, because of useful information from exception.
  • If it is in the serialization stage, probably there is no difference.

Switching off in production is not an option. Because some wrong values can appear from a data source for many reasons. In this case, issue presence will be opaque and investigation will take much longer. This only works for configs/widgets.

Are these improvements already in the master?

stas-dubrovskyi avatar Nov 25 '21 01:11 stas-dubrovskyi

I'll try to integrate apischema in squall ASAP and will provide you feedback based on benchmarks.

stas-dubrovskyi avatar Nov 25 '21 02:11 stas-dubrovskyi

Because some wrong values can appear from a data source for many reasons.

I think you misunderstood this issue. There is no question about deactivating type checking at deserialization. The issue only concerns serialization — that's why the settings key is apischema.settings.serialization.type_checking.

Are these improvements already in the master?

Yes, they are, however, installing from Github with git+https://github.com/wyfo/apischema.git@master will not compile anything. I think you should rather clone the repository, compile the files in place by executing scripts/cythonize.py (which will generate .c files) and then python setup.py build_ext --inplace, and install the local package with pip install -e path/to/cloned/apischema. You must use the same version of Python in the command python setup.py build_ext --inplace than the one you use in your benchmark.

wyfo avatar Nov 25 '21 07:11 wyfo