cube icon indicating copy to clipboard operation
cube copied to clipboard

ClickHouse crashes under concurrent queries

Open ranjeetranjan opened this issue 1 year ago • 9 comments

Problem

Getting error on clickhouse db logs

<Error> DynamicQueryHandler: Cannot send exception to client: Code: 24. DB::Exception: Cannot write to ostream at offset 1289. (CANNOT_WRITE_TO_OSTREAM), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xb3ac1da in /usr/bin/clickhouse
1. DB::WriteBufferFromOStream::nextImpl() @ 0xb4b9a32 in /usr/bin/clickhouse
2. DB::WriteBufferFromHTTPServerResponse::nextImpl() @ 0x167ab677 in /usr/bin/clickhouse
3. DB::HTTPHandler::trySendExceptionToClient(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, DB::HTTPServerRequest&, DB::HTTPServerResponse&, DB::HTTPHandler::Output&) @ 0x167300c7 in /usr/bin/clickhouse
4. DB::HTTPHandler::handleRequest(DB::HTTPServerRequest&, DB::HTTPServerResponse&) @ 0x16731995 in /usr/bin/clickhouse
5. DB::HTTPServerConnection::run() @ 0x167a44fb in /usr/bin/clickhouse
6. Poco::Net::TCPServerConnection::start() @ 0x19a6608f in /usr/bin/clickhouse
7. Poco::Net::TCPServerDispatcher::run() @ 0x19a684e1 in /usr/bin/clickhouse
8. Poco::PooledThread::run() @ 0x19c25e69 in /usr/bin/clickhouse
9. Poco::ThreadImpl::runnableEntry(void*) @ 0x19c231c0 in /usr/bin/clickhouse
10. ? @ 0x7fb0db337609 in ?
11. __clone @ 0x7fb0db25c133 in ?
 (version 22.3.20.29 (official build))

Error log in the CubeJS

Error: socket hang up
    at QueryQueue.parseResult (/usr/src/app/node_modules/@cubejs-backend/query-orchestrator/src/orchestrator/QueryQueue.js:434:13)
    at QueryQueue.executeQueryInQueue (/usr/src/app/node_modules/@cubejs-backend/query-orchestrator/src/orchestrator/QueryQueue.js:320:21)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at PreAggregationLoader.loadPreAggregationWithKeys (/usr/src/app/node_modules/@cubejs-backend/query-orchestrator/src/orchestrator/PreAggregations.ts:790:7)
    at PreAggregationLoader.loadPreAggregation (/usr/src/app/node_modules/@cubejs-backend/query-orchestrator/src/orchestrator/PreAggregations.ts:573:22)
    at preAggregationPromise (/usr/src/app/node_modules/@cubejs-backend/query-orchestrator/src/orchestrator/PreAggregations.ts:2149:30)
    at QueryOrchestrator.fetchQuery (/usr/src/app/node_modules/@cubejs-backend/query-orchestrator/src/orchestrator/QueryOrchestrator.ts:241:9)
    at OrchestratorApi.executeQuery (/usr/src/app/node_modules/@cubejs-backend/server-core/src/core/OrchestratorApi.ts:98:20)
    at /usr/src/app/node_modules/@cubejs-backend/server-core/src/core/RefreshScheduler.ts:602:13
    at async Promise.all (index 0)
    at RefreshScheduler.refreshPreAggregations (/usr/src/app/node_modules/@cubejs-backend/server-core/src/core/RefreshScheduler.ts:587:5)
    at async Promise.all (index 1)
    at RefreshScheduler.runScheduledRefresh (/usr/src/app/node_modules/@cubejs-backend/server-core/src/core/RefreshScheduler.ts:285:9)

**Increased timeout still did not worked ** CUBEJS_DB_QUERY_TIMEOUT: 30s

ranjeetranjan avatar Apr 23 '23 12:04 ranjeetranjan

Hey @ranjeetranjan 👋 Is this still a standing issue? If so, does it reproduce?

igorlukanin avatar Dec 01 '23 12:12 igorlukanin

Yes @igorlukanin Yes still facing the same issues

Here you Go details.

  1. Data Size apx 35G.
  2. Data Query date range 1 year.
  3. Dashboard has apx total of 20 Cards/chart
  4. In some query get hang-up error

ranjeetranjan avatar Jan 05 '24 16:01 ranjeetranjan

After another look at the error/data size/concurrency, I suspect that the issue is that ClickHouse doesn't handle the load well and the error is related to ClickHouse rather than Cube. It looks like your ClickHouse instance crashes and then Cube gets no response. Would you kindly try raising an issue with the ClickHouse team? https://github.com/ClickHouse/ClickHouse

igorlukanin avatar Jan 08 '24 13:01 igorlukanin

Hi @igorlukanin ! Hope you are well. Based on the log message, ClickHouse thinks the client went away.

<Error> DynamicQueryHandler: Cannot send exception to client: Code: 24. DB::Exception: Cannot write to ostream at offset 1289. (CANNOT_WRITE_TO_OSTREAM), Stack trace (when copying this message, always include the lines below):

This appears when a client disconnects from ClickHouse and we cannot write results to the network. This can happen for a variety of reasons. Here are a few that come to mind.

  1. Keepalive is not enabled on the TCP connection.
  2. The connection goes through a load balancer and the load balancer dropped the connection, perhaps due to a timeout.
  3. The application received an error and disconnected.

Is it possible to capture the query that is failing and run it by hand using curl? That would be instructive to see if there's another error at work. Here's an example:

time curl http://localhost:8123?query=select+1
1

real    0m0.019s
user    0m0.009s
sys     0m0.004s

hodgesrm avatar Apr 04 '24 21:04 hodgesrm

@hodgesrm Thanks for your views.

The connection goes through a load balancer and the load balancer dropped the connection, perhaps due to a timeout.

We are not using any load balancer in our case. The worker node directly connects to the clickhouse pod within the Kubernetes.

Based on a suggestion I run the query where the application gets a timeout.

time curl --user user:pass 'http://clickhouse-host:8123?query=SELECT%20max%28timestamp%29%20FROM%20db_name.example_table%20WHERE%20toDate%28timestamp%29%20BETWEEN%20%272023-03-01%27%20AND%20%272024-04-04%27'

real	0m4.912s
user	0m0.008s
sys	0m0.000s

ranjeetranjan avatar Apr 05 '24 07:04 ranjeetranjan

@igorlukanin Your response will be highly appreciated.

ranjeetranjan avatar Apr 19 '24 06:04 ranjeetranjan

@ranjeetranjan What is your CUBEJS_CONCURRENCY env var setting? Could you please check what happens if you increase or decrease it.

igorlukanin avatar Apr 20 '24 22:04 igorlukanin

@igorlukanin Thanks for your response. Did not set any CUBEJS_CONCURRENCY.

ranjeetranjan avatar Apr 21 '24 06:04 ranjeetranjan

Could you please check what happens if you increase or decrease it?

igorlukanin avatar May 14 '24 10:05 igorlukanin

I tried increasing and decreasing but nothing worked. When I tried with the Cube cloud it worked perfectly with the same DB and schema. We did not find any issues.

Here you go my .env configuration of my local development.

##Timeout
CUBESTORE_QUERY_TIMEOUT=60s
CUBEJS_DB_QUERY_TIMEOUT=10m

#Other
NODE_OPTIONS="--max-old-space-size=12000"

I am missing something minor. 

ranjeetranjan avatar Jul 01 '24 19:07 ranjeetranjan