Same Connection executes queries on different nodes under certain DNS settings
This is the setup that we have in our company:
- CH cluster (multiple nodes with different IPs)
- DNS server that resolved an internal hostname to the list of IPs of CH nodes
- CH clients that have low setting for
networkaddress.cache.ttlandsun.net.inetaddr.ttlto enable DNS-based service discovery and load balancing - CH client creates
Connectionand executed bunch of dependent requests:- create "temporary" † tables
- fill these tables with data
- query these tables
- drop tables
† "temporary" tables are not really temporary due to the restrictions of the CH HTTP API, so these tables have to be dropped manually.
And here is what happened:
Sometimes during the long "session" java's DNS cache was flushed, and since certain point of time CH JDBC requests started to query another CH node, despite the fact that all queries were made within the same CH Connection.
We found a workaround: currently we're finding all IPs for the hostname ourselves and creating CH Connection with one specific IP.
So, here is the question: what is the semantics of CH Connection? Shouldn't it provide any "visibility" guarantees (like if you successfully created temporary table and then try to query it within the same Connection, you are guaranteed to see that table)? If it should, then probably jdbc driver has to somehow guarantee that all queries within the same Connection go to the same CH node regardless of the client DNS configuration.
This is not a bug report, it's more of a question/suggestion/discussion.
Thanks.
Hello, thanks for sharing our experience. Currently this is expected because current ClickHouseConnection implementation is not a connection itself, it is just an object holding a separate instance of http client and some settings including host name provided in jdbc url or other way. This means that host name is resolved on every statement execution having cache in mind. This is just the simplest solution possible to handle clickhouse queries and I think the most reasonable way to connect via http interface wrapped into JDBC interface. In current setting I see the most usable configuration is to have long-lived "connections" (in our applications they live for the whole life of app) to avoid unnecessary creation and deletion of http clients with their own pools, having in mind that the host is resolved on each query execution.
I agree that with no knowledge of the implementation I would also think that statements executed in one connection are executed at the same host. I've checked the javadocs for Connection, but haven't found notion of this semantics. If you have found it somewhere, please attach a link.
As I see your workaround is the only reasonable way to achieve the guarantees you want with current implementation. As I understand this would be more natural if the driver used native protocol instead of http, but this is quite a lot of work.
Hi @serebrserg,
Thanks for your response.
If you check the specs for JDBC, on page 51 (chapter "Connections", second paragraph) you'll find the following:
From the JDBC driver perspective, a Connection object represents a client session. It has associated state information such as user ID, a set of SQL statements and result sets being used in that session, and what transaction semantics are in effect.
I think it's reasonable to expect that you can see changes from the previous operations within the same session (without the need to invent any workarounds).
I understand, that currently there is no easy implementation for this semantics, as CH http API doesn't support sessions and HTTP itself is stateless. I'm also not sure that making our workaround the default strategy will be beneficial for all CH JDBC users.
I just wanted to make you aware of this problem. I believe the "resolution" for now could be making this limitation explicit in CH documentation and prioritizing the implementation of the sessions in HTTP API.
Thanks.