clickhouse-connect icon indicating copy to clipboard operation
clickhouse-connect copied to clipboard

Support Native Interface

Open genzgd opened this issue 1 year ago • 20 comments

The native interface still retains certain advantages over HTTP, including better compression and progress packets. It would be nice to support both options. At this point this would mean:

  • creating a nativeclient subclass and implementing the required methods
  • building a transport layer that converted incoming native protocol payloads to byte blocks and sent the same generated payload we now use over HTTP

However this looks like a lot less work now than it did last March

genzgd avatar Jan 09 '23 00:01 genzgd

From related issue above:

Finally I agree that using the ClickHouse native interface can make things even better, so that is now a planned feature. https://github.com/ClickHouse/clickhouse-connect/issues/91

What's wrong with clickhouse-driver? Except official support from ClickHouse Inc.

It's well documented and has excellent (96%) code coverage. Many people use it in production for years. It is also in top 1% downloads on pypi: https://pypistats.org/packages/clickhouse-driver.

It’s good to see you chasing it in features and performance. But what’s the profit for the community in another one package?

xzkostyan avatar Jan 16 '23 09:01 xzkostyan

clickhouse-driver is a mature, solid project and we continue to recommend it the ClickHouse users in appropriate circumstances. However, there are many ClickHouse users who can't take advantage of clickhouse-driver because of its lack of HTTP support; in particular for deployments that depend on ChProxy or other HTTP caches/load balancers. HTTP support also makes ClickHouse Connect more compatible for customers who also use JDBC based tools like Airbyte or DBeaver. Finally ClickHouse Connect also has better compatibility with Superset out of the box.

At this point ClickHouse Connect performance is nearly equal to that of clickhouse-driver, and we continue to make performance improvements on a regular basis. Native support is just one of those planned improvements, and as I noted in the issue, at this point it's not a lot of work to replace the HTTP transport layer with a native protocol implementation. (It's also not our highest priority.)

I guess I would reverse the question -- what's the harm to the community in having two options for drivers over the native protocol?

genzgd avatar Jan 16 '23 14:01 genzgd

It good to see clickhouse-connect rapid development. You did a good job. As I can see current key points of in making new package are:

• proxing: load balancing/caching • pluggable transports • suite of all packages in one place • Superset integration

Please correct me if I wrong.

I guess I would reverse the question -- what's the harm to the community in having two options for drivers over the native protocol?

Well, the difficulty of choice at the first.

Let's think I'm newbie to ClickHouse and I need to get data from it in Python in most effective way. I need to learn about different internals: FORMATs (Native/HTTP/etc.), transport layers (TCP, HTTP), cache/load balancers. At the beginning I have one problem: how to read data from ClickHouse. At the end I have at least two problems: how to read data from ClickHouse and how to choose right driver for it. Of course I can spend some time to learn about internals. But should I do in the very beginning?

There was another point: what if I picked "wrong" package and already write code in it? Then I need to pick another one library and rewrite code because another library have another interface. It’s natural for different libraries of course, but it's sad for me as developer that just wants to work with ClickHouse at the first time. There are a many such examples in Russian telegram chat.

How you would continue to recommend clickhouse-driver the ClickHouse users in appropriate circumstances? As far as I can see there is only a little note about not tested libraries in 3-rd party interfaces: https://clickhouse.com/docs/search?q=clickhouse-driver. This is not a recommendation. However clickhouse-driver is the most popular Python driver for ClickHouse by daily downloads. It seems that ClickHouse Inc are literally forcing users to use clickhouse-connect without alternative.

There are TCP load balancers such as nginx/haproxy: https://kb.altinity.com/altinity-kb-setup-and-maintenance/load-balancers. Of course they are not so flexible as HTTP. But you can even balance queries on client-side with simple round-robin. There is an option for it: https://clickhouse-driver.readthedocs.io/en/latest/features.html#multiple-hosts.

Both clickhouse-driver and clickhouse-connect process data in Native format but use different transports: TCP and HTTP. You make a good project but you ended with pluggable TCP transport in clickhouse-connect after a year of development. The same thing (pluggable HTTP transport with data in Native) format could be added to clickhouse-driver as optional feature without year of development. Why do community need brand new package with the same features instead of making few PRs to existing projects: clickhouse-driver and clickhouse-sqlalchemy? Those packages been widely used by community year ago and still used now.

I don't have anything against suite package but I don't understand why you have to embed "core" packages in the suite. If you want to build "suite" package just specify "core" packages as requirements. This will make suite easier to maintain: minor changes in core packages will not require new version of the suite. As I can see the suite still relies on the SQLAlchemy. For some good reason this package is not embedded.

Please don't get me wrong, I don't want to teach you how to make packages or write software. I have no doubt in it. All my points are about making better packages for the community and focusing on better integration with the Clickhouse server by reusing existing software not rewriting it without good reason for it. And focusing on the new integrations like Superset, etc.

xzkostyan avatar Jan 19 '23 08:01 xzkostyan

I've recommended clickhouse-driver to customers with high volume queries or inserts because of the performance advantage. For example, https://github.com/ClickHouse/clickhouse-connect/issues/56#issuecomment-1284844497. As clickhouse-connect has improved in performance that difference has narrowed significantly and in many cases clickhouse-connect now has the advantage even though it uses HTTP.

And while we certainly do not force anyone to use it, for the best up to date support and compatibility with other tools, we do think most users in the long run will be better off with an officially supported ClickHouse Inc. driver. (More than 10 months after it was released as an experimental feature, clickhouse-driver does not yet support JSON column inserts).

My first interest in a Python HTTP driver started when I was at Comcast building an analytics platform based on ClickHouse. We wanted to use Superset but had only HTTP proxies and load balancers at the time to access a cluster of nearly 100 ClickHouse nodes.

To be clear this has not been an actual year of development. ClickHouse Connect started as a side project using RowBinary while I was still a ClickHouse support engineer. I came across this issue early on:

Feel free to fork the project if you really need it. But I'll recommend you to make HTTP driver from scratch or collaborate with authors of existing HTTP drivers written in Python. -- xzkostyan

We continued to have demand for better Superset compatibility, so I kept working on the project. We also took over dbt-clickhouse and HTTP support was one of the most requested features, so that was another incentive to build a solid driver. I wasn't satisfied with my original RowBinary performance so I started investigating the Native format. The design and performance of the project have grown organically as I have had time to work on it.

(As for the suite aspect, that was originally for simplicity to provide Superset support with a dynamic EngineSpec. I don't know that we'll ever expand SQLAlchemy support beyond what's necessary for Superset (I think SQLAlchemy is an outdated technology, with a very cumbersome design), so making that into a stand alone project doesn't make a lot of sense, and we actually do plan on migrating the small Superset pieces to Superset itself. I don't claim the suite itself is a particularly good thing.)

I assume you've known about clickhouse-connect for quite some time (it's been public since May or June I think) so it's unfortunate that you waited until this issue targeting native protocol support was opened to express your concerns. It's also unclear what you'd like us to do at this point. I think the time has passed to add HTTP support to clickhouse-driver.

Finally, as I've said, clickhouse-driver is a mature and stable product and there's no doubt that it's been by far the best option for Python/ClickHouse connectivity for years. It's therefore not surprising that it's widely used in ClickHouse projects and therefore popular on pypi. But I obviously believe that ClickHouse Connect is a strong alternative, with a solid design, very few bugs (that are quickly addressed), good performance, and official ClickHouse Inc. support.

genzgd avatar Jan 19 '23 10:01 genzgd

Can we merge these two drivers together and make the transport protocol choice available under a common interface?

alexey-milovidov avatar Feb 03 '23 05:02 alexey-milovidov

I suppose we can. But how should we do it?

xzkostyan avatar Feb 07 '23 21:02 xzkostyan

I assume you've known about clickhouse-connect for quite some time (it's been public since May or June I think) so it's unfortunate that you waited until this issue targeting native protocol support was opened to express your concerns. It's also unclear what you'd like us to do at this point.

Ha-ha :) That reminds me "Hitchhiker's guide to the galaxy":

“But Mr Dent, the plans have been available in the local planning office for the last nine months.” “Oh yes, well as soon as I heard I went straight round to see them, yesterday afternoon. You hadn’t exactly gone out of your way to call attention to them, had you? I mean, like actually telling anybody or anything.” “But the plans were on display …” “On display? I eventually had to go down to the cellar to find them.” “That’s the display department.” “With a flashlight.” “Ah, well the lights had probably gone.” “So had the stairs.” “But look, you found the notice didn’t you?” “Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard’.”

can't take advantage of clickhouse-driver because of its lack of HTTP support

In my opinion the only advantage of HTTP is simplicity for driver creators. So if you want to write driver in 20 minutes - go with HTTP. The rest are DISadvantages (see below),

However, there are many ClickHouse users who can't take advantage of clickhouse-driver because of its lack of HTTP support; in particular for deployments that depend on ChProxy or other HTTP caches/load balancers.

Instead of creating new driver on the top of HTTP it would be better to add support for native protocol to chproxy.

The HTTP-protocol in clickhouse is bad for 'universal' drivers (and always was). In general, HTTP is not designed for RDBMS usage: the main problem is lack of multiplexing there, and several other problems.

  1. exceptions inside the message stream:

Simple test which all drivers build on the top of HTTP fail:

echo "SELECT throwIf(number=8, 'Ups, it is the error inside the data stream') as x FROM numbers(10)" | curl -D- 'http://localhost:8123/?max_threads=1&max_block_size=1&default_format=Native' --data-binary @- | xxd 
## Those settings (max_threads & max_block_size are tuned just to simplify the test, the same can happen with normal settings too

Yes, you can use wait_end_of_query=1 as a workaround, but it will affect the performance and diskspace on the server.

  1. no ability to cancel the query when client is enough of waiting:
echo "SELECT max(toString(number)) as let_me_burn_your_cpu FROM numbers_mt(100000000000)" | timeout -v 2s curl -D- 'http://localhost:8123/?max_threads=32' --data-binary @-
# so the client was killed, but what is happening on the server? Psst, check SHOW PROCESSLIST!

You can (try) to use cancel_http_readonly_queries_on_client_close=1 (of course if you know about that problem upfront, but usually you find it after some silly clients kill your server)

  1. try to use send_logs_level / get query statistics / Profile events / and other clickhouse native protocol features.

  2. HTTP is stateless; you can start some session if needed (but then you're responsible for it, and you can not close it earlier).

  3. there are some problems in HTTP implementation in ClickHouse (caused by Poco) - for example you don't know how many clients are connected to HTTP port (if they don't execute query currently).

  4. 'multiquery is not supported'

There was some old proposal how to make HTTP itself better: https://github.com/ClickHouse/ClickHouse/issues/6294 w/o introducing new ports (like for WebSocket) or new protocols (like for grpc), simple and portable, but it was never implemented.

Can we merge these two drivers together and make the transport protocol choice available under a common interface?

Since it's impossible to close that Pandora box again - it would be best to merge them.

filimonov avatar Feb 08 '23 14:02 filimonov

I think the Hitchhiker's Guide comparison is a bit unfair :) -- it's been on our website for quite some time and mentioned in Alexey's webinars. Again no one seemed bothered by it until I opened this issue targeting native protocol support.

I generally agree with the limitations of the HTTP protocol and the ClickHouse implementation in particular. A fair amount of the work in clickhouse-connect has been to try to overcome and/or mitigate those problems. And of course I would welcome ClickHouse improvements in that area (HTTP2/web sockets would go a long way to solving those issues).

That said, the most popular driver for ClickHouse is clickhouse-jdbc, which has been HTTP based for years, and I can assure you that when dealing with the security and network teams of large organizations, having an HTTP based protocol vastly simplifies things. (In the security context, NOT supporting multiquery is actually an advantage, as it makes SQL injection attacks significantly more difficult).

I'm open to ideas about how to merge the projects, but like @xzkostyan I'm not sure what that would look like.

genzgd avatar Feb 08 '23 16:02 genzgd

@xzkostyan I will send you an invitation to a meeting to discuss our options

mshustov avatar Feb 08 '23 17:02 mshustov

I think I've repeatedly answered the question of "why creating new". Ideal or not, HTTP support is in high demand in the community (see dbt, Grafana -- using the Altinity Grafana plugin until last year, various jdbc based products like DbBeaver and Airflow, CHProxy, my own experience at Comcast with a 650TB/day cluster). At this point it's all well and good to say "the [still undocumented] native protocol is 'better'" but that doesn't help a community that (for whatever reason) prefers HTTP and the existing HTTP based ecosystem. Finally, the author of clickhouse-driver explicitly rejected HTTP support https://github.com/mymarilyn/clickhouse-driver/issues/275, https://github.com/mymarilyn/clickhouse-driver/issues/176 and said "build it from scratch" (or work with one of the existing HTTP drivers, which are really basic and slow).

So after building something "from scratch", you might understand why I'm somewhat frustrated with being attacked for looking to improve it by implementing the "better" native protocol as an option. ClickHouse Connect already performs on par with clickhouse-driver even though it uses HTTP (and HTTP query performance will improve further when https://github.com/ClickHouse/ClickHouse/pull/45593 is released, presumably this month) and significantly better in some circumstances (like Pandas dataframes). And yes, I'm obviously biased, but I think the rapid improvements in and responsive support for this package are valuable to the ClickHouse community even if the project doesn't have the long pedigree of clickhouse-driver.

That said, in retrospect I should have coordinated better with @xzkostyan when it became clear my side project was gaining traction, and we should have made more of an effort to bring that project under the ClickHouse Inc. umbrella. I'm probably guilty of having just a bit too much fun making something that I think is a pretty solid driver.

genzgd avatar Feb 08 '23 22:02 genzgd

Hi @genzgd, @xzkostyan, @alexey-milovidov, and @filimonov, it seems to me there are a couple of levels of issues here.

First practical. I'm skeptical that merging drivers is better for users, unless it's done by extending the Python clickhouse-driver. It's been around for years and there are many thousands of applications based on it. Their owners are not going to port to a new driver. Making backwards-incompatible changes through some sort of "split-the-difference" approach would also hurt them. Think Python2 -> Python3 and you know what I mean. It would likely result in 3 drivers, not two.

At the same time it's not bad to have an open market for drivers, and in fact any software related to ClickHouse. That's the quickest way to make improvements.

So if we don't merge and nobody wants to give up their work, what do we do? Have two drivers and acknowledge each other's work. Support them both. Let users decide.

Next the strategic issues. Having unnecessary fights over basic infrastructure components is not good for users and not good for ClickHouse as a whole, including anyone who makes their living from it. We're wasting resources and fragmenting effort. It leads to a poor user experience. It also takes away focus from getting features like async insert or TTLs to work easily and reliably. Databases win on core functionality, not on drivers.

I suggest that we back things up a bit and take this as an opportunity learn how to collaborate better as a community. In this particular case it would perhaps have been better for a new driver to start as a fork of clickhouse-driver. That's a common way to improve things and would have given users a natural path to a better driver if that's how it turns out. Or everybody could have agreed up front and attacked it as a community using code from a single project.

That's water under the bridge. It's not a big deal if it happens occasionally. What would be bad is to repeat this pattern on every piece of software related to ClickHouse. That's a point that deserves some collective thought.

hodgesrm avatar Feb 09 '23 00:02 hodgesrm

A quick codicil: databases have traditionally prospered with both ODBC and native clients, which have a lot of overlap. It's not winner take all by any means.

hodgesrm avatar Feb 09 '23 00:02 hodgesrm

@hodgesrm Not surprisingly I share your skepticism about merging drivers, as they are fairly opinionated about how to handle datatypes and have different approaches for low level streaming/buffer/data conversion. Also for the record when I started this experiment I could not have done a fork of clickhouse-driver that used HTTP, as I had no experience whatsoever with the native protocol or Native format (but a lot with HTTP and RowBinary), and the whole journey has been much more complex than I first imagined.

I also agree that basic competition in software is good. ClickHouse would not exist if someone had just said "oh, well fork MySQL to do analytic workloads, it's got a lot of stars on Github" (did Github exist in 2008?) I also think the current highly competitive environment, where ClickHouse has become a popular target, has pushed us to make it a better product.

However, I don't think it's quite so simple as "let the users decide", since that doesn't tell us where to direct resources or what to emphasize with our customers. I think a significant reason for this "fight" is that ClickHouse Inc is officially supporting clickhouse-connect and giving somewhat short shrift to clickhouse-driver. Now we can and probably should redress that balance somewhat, but as a company we are naturally going to gravitate toward one or the other (since it's not going to be easy to merge them like we did with ch-go and clickhouse-go), and we are also naturally going to offer better support for an option that we control.

I further believe that Python support is a pretty core ClickHouse requirement. Snowflake and BigQuery offer "first class citizen" status for their Python integrations (not to mention DuckDB), and the data science/ML world is one of our biggest target markets. I've been quite surprised about the number of questions and issues I see about Pandas which so far have exceeded anything related to native Python data types. For that reason and others I don't consider working on an improved driver "a waste of resources", and I don't think the effort expended is taking away anything from other undertakings like async inserts (I doubt Alexey is going to draft me into the core C++ work any time soon).

Anyway, the upshot is I don't have an answer, and the strategic choice is above my pay grade. Again we should have touched based with the clickhouse-driver project earlier, and I should have pushed harder against the "I don't want to do HTTP" stance I saw in those issues, and those are indeed lessons going forward. But I also have to say that the responsibility for our current state is not entirely on my shoulders.

genzgd avatar Feb 09 '23 01:02 genzgd

Hi @genzgd, thanks for the thoughtful response.

It seems like a missed opportunity not to work on drivers together. It still leaves open the question of what you think would be the right response to users who have committed to clickhouse-driver. I would expect them to continue to use @xzkostyan's work. We collectively have an obligation to support them.

hodgesrm avatar Feb 09 '23 06:02 hodgesrm

clickhouse-driver is a mature, solid project and we continue to recommend it the ClickHouse users in appropriate circumstances. However, there are many ClickHouse users who can't take advantage of clickhouse-driver because of its lack of HTTP support; in particular for deployments that depend on ChProxy or other HTTP caches/load balancers. HTTP support also makes ClickHouse Connect more compatible for customers who also use JDBC based tools like Airbyte or DBeaver. Finally ClickHouse Connect also has better compatibility with Superset out of the box.

@genzgd, should we add a note with these recommendations to the new python driver documentation? Because currently the documentation is confusing and it suggests the new driver by default. And it's really hard to find the old one using a search in the documentation. A note will explain that an HTTP-based clickhouse-connect should be used when you have ChProxy/other HTTP balancers or use JDBC-based tools, and Native-based clickhouse-drivers should be used in all the remaining cases because it's better and HTTP interface has issues by design.

tavplubix avatar Feb 10 '23 13:02 tavplubix

FYI, as one of chproxy maintainers, we're working to make chproxy compatible with the native protocol. Apart from performances, one of the reasons for this feature was that some successful drivers (like clickhouse-driver) were not compatible with HTTP.

mga-chka avatar Feb 10 '23 18:02 mga-chka

As someone with no hard requirements on HTTP, the current state of the JSON issue leaves me with a very different impression @tavplubix and I think it'd be a disservice to new users to recommend it as a defacto option.

I'm new to Clickhouse and unfortunately spent a non-trivial amount of time getting clickhouse-driver to work on AWS Lambda, only to stumble upon the JSON issue linked above.

I don't want to downplay what's a really impressive effort from an outside contributor, but it definitely seems less risky to go with something driven by Clickhouse, and it's not surprising to see the official docs reflect that.

selalipop avatar Feb 28 '23 10:02 selalipop

@genzgd What about gRPC support? Looks pretty "HTTP" to me. :)

pkit avatar Mar 02 '23 16:03 pkit

gRPC would actually be fairly easy, but you're the first person I've seen mention it. :) Also my limited review of the ClickHouse gRPC interface suggests that the advantages are fairly limited, since it doesn't really provide low level object types (just a Result.output object that contains a binary blob of the actual response in the selected ClickHouse output format). Accordingly it feels like in some ways it would be harder to implement streaming, for example.

In any case, feel free to open a feature request and see if you can generate additional interest, and we'll take that in account in our priorities.

genzgd avatar Mar 02 '23 16:03 genzgd

would be harder to implement streaming

streamable Result is available :) https://github.com/ClickHouse/ClickHouse/blob/bd5d7f3f81c33ae9d76f15af1888041161f51269/src/Server/grpc_protos/clickhouse_grpc.proto#L222

pkit avatar Mar 02 '23 16:03 pkit