mydumper icon indicating copy to clipboard operation
mydumper copied to clipboard

Clickhouse

Open davidducos opened this issue 2 years ago • 7 comments

Is it possible to export and import on clickhouse?

This just to perform the analysis and see if this is possible

davidducos avatar Nov 10 '22 18:11 davidducos

I was not able to review deeply. I use a docker container, and after fixing user privileges following https://www.bytebase.com/blog/how-to-run-clickhouse-with-docker-and-connect-using-mysql-client, I was able to do some test and I found this:

Imports:

  • Engine information needs to be replaced (for instance: ENGINE=MergeTree order by id)
  • Values should be surrounded by single quotes (very easy) Exports:
  • we will need to review how we collect the queries that are running

I was testing like from MySQL -> Clickhouse, but I think that we should do test from Clickhouse -> Clickhouse.

davidducos avatar Mar 14 '23 13:03 davidducos

@davidducos I think I can provide some additional information here since I have been testing mydumper to load MySQL data into Clickhouse.

To overcome the need of using single quotes for strings, I ran mydumper with --fields-enclosed-by "'" argument, which if I understood correctly was devised for CSV dumping, but after https://github.com/mydumper/mydumper/commit/874fa7e473d7b4ecd596dcb32a2eef680b9be164 also applies to SQL dumps. I'm not sure if this was deliberate, or a happy coincidence that allowed me to proceed in my tests.

Clickhouse supports MySqlDump input format for INSERT queries, but expects something more similar to what mysqldump outputs. It implements some schema inference mechanism from the CREATE TABLE ( ) statement of the dump, if it is present before the INSERT INTO ... VALUES ( ... ) statement. It even matches column names from the dump to the destination table, omitting those that are not present in the destination. There is no need to manipulate the ENGINE parameter. So, if you prepare the data chunks from mydumper by prepending the "schema" file, you can easily import data into Clickhouse like this:

# Create a suitable Clickhouse table, even with fewer columns than the dump
clickhouse-client -q "
CREATE TABLE IF NOT EXISTS domains ON CLUSTER '{cluster}' (
      id_domain Int64,
      domain String
)
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/{database}/{table}', '{replica}')
PRIMARY KEY (id_domain, domain)
"

chunks=$(ls domains.*.sql.gz)

for chunk_file in $chunks; do
    # Prepare the data chunk by prepending the schema
    zcat domains-schema.sql.gz "${chunk_file}" > domains.sql
    # Insert data into clickhouse
    clickhouse-client -q "INSERT INTO domains FROM INFILE 'domains.sql' FORMAT MySQLDump"
done

(I guess you could even do it with a pipe, but I have not tested it:) zcat domains-schema.sql.gz "${chunk_file}" | clickhouse-client -q "INSERT INTO domains FORMAT MySQLDump"

The dump domains.sql that I tested this with is like this

/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;

/*!40103 SET TIME_ZONE='+00:00' */;
CREATE TABLE `world_domains` (
  `id_domain` int(11) NOT NULL AUTO_INCREMENT,
  `domain` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `updated` int(11) NOT NULL,
  PRIMARY KEY (`id_domain`),
  FULLTEXT KEY `domain_fulltext_index` (`domain`)
) ENGINE=InnoDB AUTO_INCREMENT=44 DEFAULT CHARSET=latin1 COMMENT='Domains';
/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;
/*!40103 SET TIME_ZONE='+00:00' */;
INSERT INTO `world_domains` VALUES
,(39,'artsource.ch',1650270901)
,(42,'nscenter.eu',1650270901)
;

edit: corrected s/clickhouse -q/clickhouse-client -q/

Feder1co5oave avatar Jun 09 '23 18:06 Feder1co5oave

@davidducos I think I can provide some additional information here since I have been testing mydumper to load MySQL data into Clickhouse.

To overcome the need of using single quotes for strings, I ran mydumper with --fields-enclosed-by "'" argument, which if I understood correctly was devised for CSV dumping, but after 874fa7e also applies to SQL dumps. I'm not sure if this was deliberate, or a happy coincidence that allowed me to proceed in my tests.

It is not a happy coincidence. I want mydumper/myloader to be as flexible as possible.

Clickhouse supports MySqlDump input format for INSERT queries, but expects something more similar to what mysqldump outputs. It implements some schema inference mechanism from the CREATE TABLE ( ) statement of the dump, if it is present before the INSERT INTO ... VALUES ( ... ) statement. It even matches column names from the dump to the destination table, omitting those that are not present in the destination. There is no need to manipulate the ENGINE parameter. So, if you prepare the data chunks from mydumper by prepending the "schema" file, you can easily import data into Clickhouse like this:

# Create a suitable Clickhouse table, even with fewer columns than the dump
clickhouse-client -q "
CREATE TABLE IF NOT EXISTS domains ON CLUSTER '{cluster}' (
      id_domain Int64,
      domain String
)
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/{database}/{table}', '{replica}')
PRIMARY KEY (id_domain, domain)
"

chunks=$(ls domains.*.sql.gz)

for chunk_file in $chunks; do
    # Prepare the data chunk by prepending the schema
    zcat domains-schema.sql.gz "${chunk_file}" > domains.sql
    # Insert data into clickhouse
    clickhouse -q "INSERT INTO domains FROM INFILE 'domains.sql' FORMAT MySQLDump"
done

What would be very useful is a script that could translate the CREATE TABLE from MySQL to Clickhouse. Then the user run mydumper on the MySQL database, then run the script to change all the -schema.sql files, and finally myloader imports schema and data files

(I guess you could even do it with a pipe, but I have not tested it:) zcat domains-schema.sql.gz "${chunk_file}" | clickhouse -q "INSERT INTO domains FORMAT MySQLDump"

Using clickhouse -q breaks the idea of using myloader. We could implement something similar to LOAD DATA to execute the INSERT INTO <table_name> FORMAT MySQLDump and keep the INSERT statements in MySQL format in the .dat file. That is doable and we can test if works as a pipe as myloader is already doing it.

The dump domains.sql that I tested this with is like this

/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;

/*!40103 SET TIME_ZONE='+00:00' */;
CREATE TABLE `world_domains` (
  `id_domain` int(11) NOT NULL AUTO_INCREMENT,
  `domain` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `updated` int(11) NOT NULL,
  PRIMARY KEY (`id_domain`),
  FULLTEXT KEY `domain_fulltext_index` (`domain`)
) ENGINE=InnoDB AUTO_INCREMENT=44 DEFAULT CHARSET=latin1 COMMENT='Domains';
/*!40101 SET NAMES binary*/;
/*!40014 SET FOREIGN_KEY_CHECKS=0*/;
/*!40103 SET TIME_ZONE='+00:00' */;
INSERT INTO `world_domains` VALUES
,(39,'artsource.ch',1650270901)
,(42,'nscenter.eu',1650270901)
;

davidducos avatar Jun 09 '23 18:06 davidducos

Hi @Feder1co5oave,

The initial steps to perform backups on MySQL/Percona/MariaDB in a format that Clickhouse will be able to import had started! Hopefully, will be a prerelease soon that you will be able to test.

davidducos avatar Jul 26 '24 16:07 davidducos

Oh shit... After a few tests and some reading, I realized that we are not going to be able to use myloader with this format.

davidducos avatar Jul 26 '24 19:07 davidducos

Hi @davidducos thank you for keeping me in the loop regarding this! Would you like to elaborate a bit on what you found that will make using myloader impossible?

Feder1co5oave avatar Jul 28 '24 13:07 Feder1co5oave

Hi @Feder1co5oave, the problem is that myloader, connect to clickhouse remotely using MySQL libraries, and "FORMAT MySQLDump" seems that works locally and/or using the Clickhouse client. Using myloader I get:

** Message: 12:20:04.658: Thread 1: restoring sakila.david part 1 of 1 from sakila.david.00000.sql | Progress 1 of 1. Tables 0 of 1 completed
** Message: 12:20:04.658: Thread 3: Data import ended
** Message: 12:20:04.658: Thread 4: Data import ended
** Message: 12:20:04.658: Thread 2: Data import ended
** (myloader:13546): WARNING **: 12:20:04.658: Connection 216 - ERROR 62: Code: 62. DB::Exception: Empty query. (SYNTAX_ERROR) (version 24.6.2.17 (official build))
** (myloader:13546): WARNING **: 12:20:04.659: Connection 216 - ERROR 62: Code: 62. DB::Exception: Empty query. (SYNTAX_ERROR) (version 24.6.2.17 (official build))
** (myloader:13546): WARNING **: 12:20:04.659: Connection 216 - ERROR 62: Code: 62. DB::Exception: Empty query. (SYNTAX_ERROR) (version 24.6.2.17 (official build))
** (myloader:13546): WARNING **: 12:20:04.659: Set session failed: /*!40101 SET NAMES binary*/
** (myloader:13546): WARNING **: 12:20:04.659: Set session failed: /*!40014 SET FOREIGN_KEY_CHECKS=0*/
** (myloader:13546): WARNING **: 12:20:04.659: Set session failed: /*!40103 SET TIME_ZONE='+00:00' */
(myloader:13546): GLib-ERROR **: 12:20:04.659: ../../../glib/gmem.c:108: failed to allocate 18446650248716671671 bytes

We can ignore myloader errors as I'm using this command:

./myloader -d data/ -o -h 172.17.0.6 -u root  --serialized-table-creation --defaults-extra-file=./mydumper.cnf -v 3 --port 9004 --ignore-errors 62 -o -q 0

The GLib-ERROR is weird. However, during the analysis I tried:

$ mysql -h 172.17.0.6 --port 9004 -u root sakila
mysql> INSERT INTO `david` FROM INFILE 'data/sakila.david.00000.dat' FORMAT MySQLDump;
ERROR 78 (00000): Code: 78. DB::Exception: Query has infile and was send directly to server. (UNKNOWN_TYPE_OF_QUERY) (version 24.6.2.17 (official build))

I had no issue when I tried to execute the INSERT locally on clickhouse container. Please, let me know if I'm doing something wrong or exist any workaround

davidducos avatar Jul 29 '24 12:07 davidducos