Fetch rows in JSON format
Summary
This is a draft PR which adds the support for fetching rows in JSON format.
The current Query::fetch fetches the results using FORMAT RowBinary only which requires a strict 1:1 mapping between the query schema and the deserialized type.
And that has a lot of limitations:
- #10
- #101
- #53
- etc
This PR allows the row deserialization into a T: Deserialize which eliminates the limitations of Query::fetch:
- when the table schema is not known:
SELECT * from ? - when the table schema is not specified:
DESCRIBE TABLE ? - when we read less columns than we select
Unresolved questions
- am aware (well, now) of https://github.com/suharev7/clickhouse-rs but while that library lifts some of the limitations, it's very different from this one and I'd rather stick to this in hope that #10 will eventually be supported
- the naming
Query::json,Query::json_one,Query::json_allis very unfortunate- maybe,
Query::fetch_json,Query::fetch_json_onewould be better albeit verbose
- maybe,
Query::jsonrequires thewatchfeature which is a bit misleading - we can improve this
Checklist
Delete items not relevant to your PR:
- [x] Unit and integration tests covering the common scenarios were added
- [ ] A human-readable description of the changes was provided so that we can include it in CHANGELOG later
- [ ] For significant changes, documentation in README and https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials
Could it be a more flexible approach to provide a "raw" method such as
pub fn query_raw(format: DataFormat) -> Result<RawCursor>
So that then it is possible to extract either raw bytes or a JSON repr from that RawCursor? This way, it will be possible to support all JSON types and streaming into files with CSV/TSV/Parquet, etc.
WDYT?
CC @serprex, @loyd, @mshustov
Let's extract Debug impl for Query as a separate PR as it addresses https://github.com/ClickHouse/clickhouse-rs/issues/146 (thank you!), so that it can be merged quickly, while we are discussing the rest.
+1 on naming json fetch_json etc
Could it be a more flexible approach to provide a "raw" method such as
@slvrtrn That would be useful on its own as well - see #154, #60, etc.
First of all, the only reason why this crate already contains (conditionally!) serde-json is only WA for WATCH, which will go away with moving to the Native format.
Secondly, the initial problem is the absence of dynamic schemas, not the "I want exactly JSON" problem.
Thus, we should avoid any additional formats that should be covered by basic cursor-like API (Apache Arrow can have another column-based API, for instance). I want to remember that the semantics of the current query() API is not "give me RowBinary". It's about providing Rust structures. RowJsonCursor<T> seems very unobvious to me. JSON is a string in some predefined format and it isn't related to any T.
By providing such specific interfaces, we're opening a black hole: the next request will be "I want CSV format", "I want Vertical (lol) format", and so on. All these formats will be difficult to support later, because Bytes -> T deserialization can be very differently implemented (think about Native or any other column-based formats). So, I think that Cursor<T> should have only one possible implementation (or several ones if it's required for some sort of WA), but only as implementation details. Yep, it can even use JSON internally if the user wants schemas not supported by RowBinary, but it's not "I want JSON"; it's "I want dynamic schemas" flag/inference.
@slvrtrn's suggestion about query_raw seems much more flexible and doesn't lead to new dependencies. But, again, it doesn't provide any -> T deserialization, only async iterator over chunks. And it's great: if a user wants to parse it on his own, it's okay.
Secondly, the initial problem is the absence of dynamic schemas
Correct. I stated the original problem: it's impossible right now to get the result of DESCRIBE TABLE using clickhouse-rs. And https://github.com/ClickHouse/clickhouse-rs/issues/10 is 3 years old already.
Of course, query_raw could also cover this issue (even if everybody will be forced to write their own deserializer) - but when? If another 3 years, then it's better to have at least something that unblocks use cases similar to #10.
If only we had a working PostgreSQL interface, we'd stick with sqlx. Or a native sqlx-clickhouse driver would be even more perfect.
query_raw could also cover this issue (even if everybody will be forced to write their own deserializer) - but when?
Actually, I think we can implement query_raw pretty fast. At least, I don't see any issues with it now.
If another 3 years
I totally understand your displeasure. There are several points why it wasn't resolved in time:
- This crate was a personal initiative during my free time so that I couldn't solve any more-than-trivial issues unrelated to my needs.
- Moving to
RowBinaryWithNamesAndTypesseems to result in a performance penalty or unmaintainable code.Nativecould potentially solve these problems, but it's undocumented and, formally, unstable.
Although, we can use NamesAndTypes only for deserialize_any (dynamic schemas) and probably it won't affect anything.
I hope both reasons become irrelevant now that ClickHouse's team members have begun working on this crate.
Or a native sqlx-clickhouse driver would be even more perfect.
Maybe yes, maybe not. It's not a simple statement. I liked the SQLx-like approach much more until we (I mean, in my team) started actively using it (for PostgreSQL), and.. ok, let's say that it's full of pain for multiple nontrivial tasks. But I don't want to delve into this topic here, it's worth considering after the sqlx support by detailed comparison and benchmarking. I'll be totally fine if CH env team decides to focus on sqlx instead as the official approach.
So, query_raw()? Anyway it's useful to have in this crate.
So, query_raw()?
@loyd All right. Thanks for the very detailed explanation about the background of this!
@pravic, do you think something like this could work instead - https://github.com/ClickHouse/clickhouse-rs/pull/182? That draft also contains potential usage examples.
Closing since we have Query::fetch_bytes now, and a few related examples: