clickhouse-java icon indicating copy to clipboard operation
clickhouse-java copied to clipboard

What's an efficient way to insert CSV file data into existing table in ClickHouse

Open shabir1 opened this issue 2 years ago • 11 comments

What's an efficient way to insert CSV file data into an existing table in ClickHouse? I need an optimal way to insert CSV data in existing tables In terms of memory as well as time. Every time data should be appended into the existing table, I tried the below code but it replaces the existing data.

ClickHouseClient.load(
                ClickHouseNode.of(uri),
                tableName,
                // will parse file name later so that you don't have to specify compression and format
                ClickHouseFile.of(filePath, ClickHouseCompression.NONE, 0, ClickHouseFormat.CSV)).get();

shabir1 avatar Jul 13 '23 09:07 shabir1

I tried this also but the same issue it overwrites the data.

 ClickHousePassThruStream stream = ClickHousePassThruStream.of(
                new FileInputStream(csvFilePath),
                ClickHouseCompression.NONE,
                0,
                ClickHouseFormat.CSVWithNames
        );
        CompletableFuture<ClickHouseResponseSummary> load = ClickHouseClient.load(server, tableName, stream);
        ClickHouseResponseSummary clickHouseResponse = load.get();
        System.out.println(clickHouseResponse.getWrittenRows());

shabir1 avatar Jul 13 '23 09:07 shabir1

I tried this also but the same issue it overwrites the data.

 ClickHousePassThruStream stream = ClickHousePassThruStream.of(
                new FileInputStream(csvFilePath),
                ClickHouseCompression.NONE,
                0,
                ClickHouseFormat.CSVWithNames
        );
        CompletableFuture<ClickHouseResponseSummary> load = ClickHouseClient.load(server, tableName, stream);
        ClickHouseResponseSummary clickHouseResponse = load.get();
        System.out.println(clickHouseResponse.getWrittenRows());

shabir1 avatar Jul 13 '23 09:07 shabir1

Hi @shabir1, you probably want to start with INSERT INFO FROM INFILE. Please check example below and this test.

Properties props = new Properties();
props.setProperty("localFile", "true");
try (ClickHouseConnection conn = newConnection(props);
    PreparedStatement stmt = conn.prepareStatement(
        "insert into test_load_infile_with_params from infile ? format CSV")) {
    stmt.setString(1, "/tmp/my.csv");
    stmt.executeUpdate();
}

I tried this also but the same issue it overwrites the data.

Why are you certain the data was overwritten if it consists entirely of insert statements? Which table engine are you using? You may check system.query_log to review the queries.

zhicwu avatar Jul 13 '23 11:07 zhicwu

@zhicwu Actually, I have very huge files more than 2GB, I want an efficient way to upload the CSV data into the existing table without any memory issues and with in less time. Is there any way to load data from a disk not from memory, like streaming the data?

shabir1 avatar Jul 13 '23 14:07 shabir1

@zhicwu And sometimes it throws an error when the file takes time to upload more than 30 secs.

java.util.concurrent.ExecutionException: com.clickhouse.client.ClickHouseException: Code: 159. Execution timed out, server ClickHouseNode [uri=http://localhost:8123/ShabirBhat_project]@497090432
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
	at com.datoin.ch.CHclientTest.insert(CHclientTest.java:54)
	at com.datoin.ch.CHclientTest.main(CHclientTest.java:36)
Caused by: com.clickhouse.client.ClickHouseException: Code: 159. Execution timed out, server ClickHouseNode [uri=http://localhost:8123/ShabirBhat_project]@497090432
	at com.clickhouse.client.ClickHouseException.of(ClickHouseException.java:164)
	at com.clickhouse.client.ClickHouseClient.executeAndWait(ClickHouseClient.java:1068)
	at com.clickhouse.client.ClickHouseRequest.executeAndWait(ClickHouseRequest.java:2154)
	at com.clickhouse.client.ClickHouseClient.lambda$load$8(ClickHouseClient.java:490)
	at com.clickhouse.client.ClickHouseClient.run(ClickHouseClient.java:232)
	at com.clickhouse.client.ClickHouseClient.lambda$submit$4(ClickHouseClient.java:284)
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
	at com.clickhouse.client.ClickHouseClient.executeAndWait(ClickHouseClient.java:1056)
	... 8 more

shabir1 avatar Jul 13 '23 14:07 shabir1

Is there any way to load data from a disk not from memory, like streaming the data?

Using the INSERT INTO FROM INFILE statement is a streaming approach. While it's not zero-copy at this point, it is quite similar to using curl to load a file into ClickHouse.

And sometimes it throws an error when the file takes time to upload more than 30 secs.

Please consider increasing the socket_timeout value, which is set to a default of 30 seconds.

zhicwu avatar Jul 14 '23 08:07 zhicwu

@zhicwu Will you please share the code snippet of INSERT INTO FROM INFILE using Java ClickHouseClient.

shabir1 avatar Jul 14 '23 09:07 shabir1

Will you please share the code snippet of INSERT INTO FROM INFILE using Java ClickHouseClient.

@shabir1 PTAL at this example https://github.com/ClickHouse/clickhouse-java/blob/d6eff62191c87c37867c51f7110bc4c943a77ec8/clickhouse-client/src/test/java/com/clickhouse/client/ClientIntegrationTest.java#L1650

The current implementation doesn't allow to provide a schema for the destination table, which might help to avoid problems like this https://github.com/ClickHouse/clickhouse-java/issues/1413 WDYT @mzitnik @zhicwu ?

mshustov avatar Aug 15 '23 09:08 mshustov

@mshustov How to set memory in the below method? like max_memory_usage_for_user, max_memory_usage, etc

   static long insert(String tableName, String csvFilePath) throws Exception {
        ClickHouseNode server = ClickHouseNode.builder()
                .host(System.getProperty("chHost", "localhost"))
                .port(ClickHouseProtocol.HTTP, Integer.getInteger("chPort", 8123))
                .database("ShabirBhat_project")
                .addOption("connect_timeout", String.valueOf(1350 * 1000))
                .addOption("socket_timeout", String.valueOf(1350 * 1000))
                .credentials(ClickHouseCredentials.fromUserAndPassword(
                        System.getProperty("chUser", "default"), System.getProperty("chPassword", "")))
                .build();
        try (BufferedInputStream bufferedInputStream = new BufferedInputStream(new FileInputStream(csvFilePath))) {
            ClickHousePassThruStream stream = ClickHousePassThruStream.of(
                    bufferedInputStream,
                    ClickHouseCompression.NONE,
                    0,
                    ClickHouseFormat.CSVWithNames
            );
            CompletableFuture<ClickHouseResponseSummary> load = ClickHouseClient.load(server, tableName, stream);
            ClickHouseResponseSummary clickHouseResponse = load.get();
            return clickHouseResponse.getWrittenRows();
        } catch (Exception e) {
            System.out.println("Failed to  load");
            e.printStackTrace();
            throw new Exception();
        }
    }

shabir1 avatar Aug 15 '23 14:08 shabir1

@mshustov On huge files, the above method failed with memory issues, Most failures occurred on the file that contained more than 300 columns.

ERROR [2023-07-27 06:20:36,451] com.datoin.library.dataset.sql.utility.commons.abstracts.sql.AbstractDBEngine: Failed to load dataset
java.util.concurrent.ExecutionException: com.clickhouse.client.ClickHouseException: Code: 241. DB::Exception: Memory limit (total) exceeded:
would use 3.36 GiB (attempt to allocate chunk of 4360384 bytes), maximum: 3.24 GiB. OvercommitTracker decision: Query was selected to stop by
 OvercommitTracker. (MEMORY_LIMIT_EXCEEDED) (version 23.6.2.18 (official build))
, server ClickHouseNode [uri=http://localhost:8123/cayleytest95814, options={connect_timeout=5000000,socket_timeout=5000000}]@75218225
        at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)

shabir1 avatar Aug 15 '23 14:08 shabir1

If the file size is 5GB but the number of columns is less (less than 10 columns).then the above method works fine. But if the file size is just 200MB but the number of columns is huge (greater than 800 columns) then the above method failed with memory issues.

shabir1 avatar Aug 15 '23 14:08 shabir1