clickhouse.rs icon indicating copy to clipboard operation
clickhouse.rs copied to clipboard

Fetch rows in JSON format

Open pravic opened this issue 1 year ago • 10 comments

Summary

This is a draft PR which adds the support for fetching rows in JSON format.

The current Query::fetch fetches the results using FORMAT RowBinary only which requires a strict 1:1 mapping between the query schema and the deserialized type. And that has a lot of limitations:

  • #10
  • #101
  • #53
  • etc

This PR allows the row deserialization into a T: Deserialize which eliminates the limitations of Query::fetch:

  • when the table schema is not known: SELECT * from ?
  • when the table schema is not specified: DESCRIBE TABLE ?
  • when we read less columns than we select

Unresolved questions

  • am aware (well, now) of https://github.com/suharev7/clickhouse-rs but while that library lifts some of the limitations, it's very different from this one and I'd rather stick to this in hope that #10 will eventually be supported
  • the naming Query::json, Query::json_one, Query::json_all is very unfortunate
    • maybe, Query::fetch_json, Query::fetch_json_one would be better albeit verbose
  • Query::json requires the watch feature which is a bit misleading - we can improve this

Checklist

Delete items not relevant to your PR:

  • [x] Unit and integration tests covering the common scenarios were added
  • [ ] A human-readable description of the changes was provided so that we can include it in CHANGELOG later
  • [ ] For significant changes, documentation in README and https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

pravic avatar Sep 17 '24 23:09 pravic

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Sep 17 '24 23:09 CLAassistant

Could it be a more flexible approach to provide a "raw" method such as

pub fn query_raw(format: DataFormat) -> Result<RawCursor>

So that then it is possible to extract either raw bytes or a JSON repr from that RawCursor? This way, it will be possible to support all JSON types and streaming into files with CSV/TSV/Parquet, etc.

WDYT?

CC @serprex, @loyd, @mshustov

slvrtrn avatar Sep 18 '24 16:09 slvrtrn

Let's extract Debug impl for Query as a separate PR as it addresses https://github.com/ClickHouse/clickhouse-rs/issues/146 (thank you!), so that it can be merged quickly, while we are discussing the rest.

slvrtrn avatar Sep 18 '24 16:09 slvrtrn

+1 on naming json fetch_json etc

serprex avatar Sep 18 '24 18:09 serprex

Could it be a more flexible approach to provide a "raw" method such as

@slvrtrn That would be useful on its own as well - see #154, #60, etc.

pravic avatar Sep 18 '24 21:09 pravic

First of all, the only reason why this crate already contains (conditionally!) serde-json is only WA for WATCH, which will go away with moving to the Native format.

Secondly, the initial problem is the absence of dynamic schemas, not the "I want exactly JSON" problem.

Thus, we should avoid any additional formats that should be covered by basic cursor-like API (Apache Arrow can have another column-based API, for instance). I want to remember that the semantics of the current query() API is not "give me RowBinary". It's about providing Rust structures. RowJsonCursor<T> seems very unobvious to me. JSON is a string in some predefined format and it isn't related to any T.

By providing such specific interfaces, we're opening a black hole: the next request will be "I want CSV format", "I want Vertical (lol) format", and so on. All these formats will be difficult to support later, because Bytes -> T deserialization can be very differently implemented (think about Native or any other column-based formats). So, I think that Cursor<T> should have only one possible implementation (or several ones if it's required for some sort of WA), but only as implementation details. Yep, it can even use JSON internally if the user wants schemas not supported by RowBinary, but it's not "I want JSON"; it's "I want dynamic schemas" flag/inference.

@slvrtrn's suggestion about query_raw seems much more flexible and doesn't lead to new dependencies. But, again, it doesn't provide any -> T deserialization, only async iterator over chunks. And it's great: if a user wants to parse it on his own, it's okay.

loyd avatar Sep 20 '24 17:09 loyd

Secondly, the initial problem is the absence of dynamic schemas

Correct. I stated the original problem: it's impossible right now to get the result of DESCRIBE TABLE using clickhouse-rs. And https://github.com/ClickHouse/clickhouse-rs/issues/10 is 3 years old already.

Of course, query_raw could also cover this issue (even if everybody will be forced to write their own deserializer) - but when? If another 3 years, then it's better to have at least something that unblocks use cases similar to #10.

If only we had a working PostgreSQL interface, we'd stick with sqlx. Or a native sqlx-clickhouse driver would be even more perfect.

pravic avatar Sep 20 '24 19:09 pravic

query_raw could also cover this issue (even if everybody will be forced to write their own deserializer) - but when?

Actually, I think we can implement query_raw pretty fast. At least, I don't see any issues with it now.

If another 3 years

I totally understand your displeasure. There are several points why it wasn't resolved in time:

  • This crate was a personal initiative during my free time so that I couldn't solve any more-than-trivial issues unrelated to my needs.
  • Moving to RowBinaryWithNamesAndTypes seems to result in a performance penalty or unmaintainable code. Native could potentially solve these problems, but it's undocumented and, formally, unstable.

Although, we can use NamesAndTypes only for deserialize_any (dynamic schemas) and probably it won't affect anything.

I hope both reasons become irrelevant now that ClickHouse's team members have begun working on this crate.

Or a native sqlx-clickhouse driver would be even more perfect.

Maybe yes, maybe not. It's not a simple statement. I liked the SQLx-like approach much more until we (I mean, in my team) started actively using it (for PostgreSQL), and.. ok, let's say that it's full of pain for multiple nontrivial tasks. But I don't want to delve into this topic here, it's worth considering after the sqlx support by detailed comparison and benchmarking. I'll be totally fine if CH env team decides to focus on sqlx instead as the official approach.


So, query_raw()? Anyway it's useful to have in this crate.

loyd avatar Sep 21 '24 21:09 loyd

So, query_raw()?

@loyd All right. Thanks for the very detailed explanation about the background of this!

pravic avatar Sep 22 '24 16:09 pravic

@pravic, do you think something like this could work instead - https://github.com/ClickHouse/clickhouse-rs/pull/182? That draft also contains potential usage examples.

slvrtrn avatar Nov 28 '24 12:11 slvrtrn

Closing since we have Query::fetch_bytes now, and a few related examples:

slvrtrn avatar Jun 09 '25 11:06 slvrtrn