msgspec
msgspec copied to clipboard
Should we support querystring / `x-www-form-urlencoded` messages?
URL querystrings/x-www-form-urlencoded forms are structured but untyped messages. The python standard library has a few tools for encoding/decoding these:
In [2]: urllib.parse.parse_qs("x=1&y=true&z=a&z=b")
Out[2]: {'x': ['1'], 'y': ['true'], 'z': ['a', 'b']}
This is annoying to work with manually because the output is always of type dict[str, list[str]]. This means that:
- The string values have to be manually cast to the expected types
- Fields where you expect a single value have to be validated (or only the last value used)
- Missing required fields and default values have to be manually handled
A library like Pydantic may be used to ease some of the ergonomic issues here, but adds extra overhead.
Since msgspec is already useful for parsing JSON payloads into typed & structured objects, we might support a new querystring encoding/decoding that makes use of msgspec's existing type system to handle the decoding and validation. A lot of the code needed to handle this parsing already exists in msgspec, it's mostly just plumbing needed to hook things together. For performance, I'd expect this to be ~as fast as our existing JSON encoder/decoder.
Proposed interface:
# msgspec/querystring.py
def encode(obj: Any) -> bytes:
"""Encode an object as a querystring.
This returns `bytes` not `str`, since that's what `msgspec` returns for other encodings.
"""
...
def decode(buf: bytes | str, type: Type[T] = dict[str, list[str]]) -> T:
"""Decode a querystring.
If `type` is passed, a value of that type is returned (or an error is raised).
If `type` is not passed, a `dict[str, list[str]]` is returned containing all passed query parameters.
This matches the behavior of `urllib.parse.parse_qs`.
"""
...
Proposed encoding/decoding scheme:
- Nested objects are not supported due to querystring restrictions. We don't try to do anything complicated like
railsorsinatrado (i.e. nofoo[][key]=barstuff). - A valid
typemust be a top-level object-like (struct, dataclass, ...) type, mapping fields to value types
The following value types are supported
int,float,str, and str-like types (datetimes, ...) map to/from their str representations, quoting as neededboolserializes to"true"/"false". When deserializing,"","1"and"0"are also accepted (are there other common values?)Noneserializes as"". When decoding"null"is also accepted.- Sequences of the above (e.g.
list/tuple/...) map to/from multiple values set for a field. So a fieldawith value("x", None, True, 3)would be"a=x&a=&a=true&a=3" - All builtin constraints are also supported
Questions:
- Do the above encodings make sense?
- Do the restrictions on supported types make sense? In particular, note the no-nested-objects/sequences restriction
- Are there other options we'd want to expose on
encode/decode? The stdlib also exposes a few options that I've never needed to change:max_num_fieldsto limit the number of fields when decodingseparatorto change the character used for separating fields (defaults to&).
- Is
msgspec.querystringthe best namespace for this format, or is there a better name we could use? - Does this seem like something that would be useful to have in msgspec? The intent here is for
msgpspecto handle much of the parsing/validation that a typical web server would need to handle in a performant and useful way.
Some prior art:
- https://github.com/nox/serde_urlencoded
- https://github.com/gorilla/schema
Do the above encodings make sense?
They sound very reasonable. I specifically like the aspect of optionally coercing values into common expected types (int/bools and such)
Do the restrictions on supported types make sense?
I'd say so. If users need more complex parsing, they could always implement it on top of msgspecs output.
Is
msgspec.querystringthe best namespace for this format
This to me depends on the direction you want to take this library to be going. If it's - as you said - to provide general parsing / validation utilities commonly needed for webservers then this would be a good namespace, since it follows the established schema (msgspec.json, msgspec.msgpack) and could be easily extended (e.g. msgspec.multipart)
This brings me to my question regarding msgspecs mission statement:
As already expressed on discord, I would welcome msgpec seeing itself as a "one-stop-shop offering fast, correct and type safe parsing / validation for webserver needs", since this is currently somewhat missing in Python. There are many projects that do parts of it, but often installing 4 different libraries, each with a very narrow scope isn't desirable, so it would definitely fill a niche there.
The question however is, what do you think the scope of this would be? If query strings, and form-urlencoded is supported, I'd say multipart would make sense as well. How about other things like URLs? If msgspec already parses query strings, this would kinda fit in.
This scheme makes a lot of sense for url queries. I have been using a similar scheme with msgspec to serialize parsed database query tokens to "kwarg-like" collections I can evil eval(), obviously with similar caveats. I like your idea better.
I’m for it for the simple fact that the validation of POST JSON data and query string data should (in my mind) be handled by the same system. This makes producing error messages on bad requests consistent, as well as the interface for passing input data from the request to database operation handlers.
I am currently using Falcon in an experimental system and it has support for plucking items out of a query string one at a time and coercing them into Python objects (e.g.: request.get_param_as_datetime), but it doesn’t have the same kind of constraints that msgspec has (for instance, enforcing a value is time zone aware).
I would appreciate the ability to define a msgspec Struct and decode a query string into that object, or throw errors in the same way as a JSON payload.