clickhouse-java icon indicating copy to clipboard operation
clickhouse-java copied to clipboard

Spark JDBC: DB::Exception: Cannot convert string '2024-09-10 22:58:20.0' to type DateTime

Open maver1ck opened this issue 1 year ago • 1 comments

Describe the bug

I'm trying to query data from Clickhouse using Spark jdbc connector. I'm using some filters on timestamps. As a result I'm getting exception. Cannot convert string '2024-09-10 22:58:20.0' to type DateTime. (TYPE_MISMATCH)

Steps to reproduce

  1. Create clickhouse tables
  2. Run following Spark code

Expected behaviour

  • Query run successfully

Code example

from pyspark.sql import SparkSession

# Set up the SparkSession to include ClickHouse as a custom catalog
spark = SparkSession.builder \
    .appName("ClickHouse Catalog Example") \
    .config("spark.jars.packages", "com.clickhouse:clickhouse-jdbc:0.6.5,org.apache.httpcomponents.client5:httpclient5:5.3.1") \
    .config("spark.sql.catalog.clickhouse", "com.clickhouse.spark.ClickHouseCatalog") \
    .config("spark.sql.catalog.clickhouse.host", "clickhouse-kim.clickhouse.svc") \
    .config("spark.sql.catalog.clickhouse.http_port", "8123") \
    .config("spark.sql.catalog.clickhouse.database", "telemetry") \
    .config("spark.sql.catalog.clickhouse.driver", "com.clickhouse.jdbc.ClickHouseDriver") \
    .config("spark.sql.catalog.clickhouse.user", "admin") \
    .config("spark.sql.catalog.clickhouse.password", "admin") \
    .getOrCreate()

jdbc_df = spark.read.jdbc("jdbc:clickhouse://clickhouse-kim.clickhouse.svc:8123/telemetry", "telemetry.reference_peaks", properties={"user": "admin", "password": "admin", "driver": "com.clickhouse.jdbc.ClickHouseDriver"})
jdbc_df.createOrReplaceTempView("reference_peaks")

# Parameters ts_start and ts_end (example values)
ts_start = '2024-09-10 23:00:00'
ts_end = '2024-09-10 23:04:59'

# Spark SQL query
query = f"""
    SELECT *
    FROM reference_peaks
    WHERE tts > to_timestamp('{ts_start}') - INTERVAL 100 SECOND
    AND tts < to_timestamp('{ts_end}') + INTERVAL 100 SECOND
    AND ts >= to_timestamp('{ts_start}')
    AND ts < to_timestamp('{ts_end}')
"""

# Execute the query using Spark SQL
result_df = spark.sql(query)

# Show the dataframe (for verification)
result_df.count()

Error log


24[/10/04](http://localhost:8888/10/04) 10:48:51 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.sql.BatchUpdateException: Code: 53. DB::Exception: Cannot convert string '2024-09-10 22:58:20.0' to type DateTime. (TYPE_MISMATCH) (version 24.9.1.3278 (official build))

	at com.clickhouse.jdbc.SqlExceptionUtils.batchUpdateError(SqlExceptionUtils.java:107)
	at com.clickhouse.jdbc.internal.SqlBasedPreparedStatement.executeAny(SqlBasedPreparedStatement.java:223)
	at com.clickhouse.jdbc.internal.SqlBasedPreparedStatement.executeQuery(SqlBasedPreparedStatement.java:286)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:275)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base[/java.util.concurrent.ThreadPoolExecutor.runWorker](http://localhost:8888/java.util.concurrent.ThreadPoolExecutor.runWorker)(ThreadPoolExecutor.java:1136)
	at java.base[/java.util.concurrent.ThreadPoolExecutor](http://localhost:8888/java.util.concurrent.ThreadPoolExecutor)$Worker.run(ThreadPoolExecutor.java:635)
	at java.base[/java.lang.Thread.run](http://localhost:8888/java.lang.Thread.run)(Thread.java:840)
24[/10/04](http://localhost:8888/10/04) 10:48:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.135 executor driver): java.sql.BatchUpdateException: Code: 53. DB::Exception: Cannot convert string '2024-09-10 22:58:20.0' to type DateTime. (TYPE_MISMATCH) (version 24.9.1.3278 (official build))

Configuration

Environment

  • Client version: 0.6.5
  • OS: MacOS

ClickHouse server

  • ClickHouse Server version: 24.9.1.3278
  • CREATE TABLE statements for tables involved:
CREATE TABLE telemetry.reference_peaks_shard ON CLUSTER '{cluster}'
(
    hash_1 String,
    hash_2 String,
    ts DateTime,
    offset Int32,
    channel Int32,
    station_id String,
    tts DateTime,
    batchid Int32,
    org_ts DateTime
)
ENGINE = MergeTree
ORDER BY ts;

CREATE TABLE telemetry.reference_peaks ON CLUSTER '{cluster}' AS telemetry.reference_peaks_shard
ENGINE = Distributed('{cluster}', 'telemetry', 'reference_peaks_shard', channel);

maver1ck avatar Oct 04 '24 08:10 maver1ck

I have the same issue

Environment Client version: 0.8.2 OS: Linux (kubernetes) ARM node

ClickHouse server ClickHouse Server version: 24.8.0 select query: SELECT id ... FROM ... WHERE create_time >= '2025-03-18 16:18:43.727';

najibulloShapoatov avatar Mar 28 '25 19:03 najibulloShapoatov