seatunnel icon indicating copy to clipboard operation
seatunnel copied to clipboard

[Test] [Test Hive Connector V2] Test the Hive Source Connector V2 and record the problems encountered in the test

Open EricJoy2048 opened this issue 2 years ago • 5 comments

SeaTunnel version: dev Hadoop version: Hadoop 2.10.2 Flink version: 1.12.7 Spark version: 2.4.3, scala version 2.11.12

Problems found to be fixed

  • [x] https://github.com/apache/incubator-seatunnel/issues/2792
  • [x] https://github.com/apache/incubator-seatunnel/issues/2794
  • [x] https://github.com/apache/incubator-seatunnel/issues/2799
  • [x] https://github.com/apache/incubator-seatunnel/issues/2803
  • [x] https://github.com/apache/incubator-seatunnel/issues/2811
  • [x] https://github.com/apache/incubator-seatunnel/issues/2812
  • [x] https://github.com/apache/incubator-seatunnel/issues/2815
  • [x] https://github.com/apache/incubator-seatunnel/issues/2473
  • [x] https://github.com/apache/incubator-seatunnel/issues/2837
  • [x] https://github.com/apache/incubator-seatunnel/issues/2868
  • [x] https://github.com/apache/incubator-seatunnel/issues/2871
  • [x] https://github.com/apache/incubator-seatunnel/issues/2894

Hive Source Connector

1. test text file format table

create table test_hive_source(
     test_tinyint                          TINYINT,
     test_smallint                       SMALLINT,
     test_int                                INT,
     test_bigint                           BIGINT,
     test_boolean                       BOOLEAN,
     test_float                             FLOAT,
     test_double                         DOUBLE,
     test_string                           STRING,
     test_binary                          BINARY,
     test_timestamp                  TIMESTAMP,
     test_decimal                       DECIMAL(8,2),
     test_char                             CHAR(64),
     test_varchar                        VARCHAR(64),
     test_date                             DATE,
     test_array                            ARRAY<INT>,
     test_map                              MAP<STRING, FLOAT>,
     test_struct                           STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
     )
PARTITIONED BY (test_par1 STRING, test_par2 STRING);

// insert 10 rows in the table partition(test_par1='par1', test_par2='par2')
insert into table test_hive_source partition(test_par1='par1', test_par2='par2') select 
1 as test_tinyint,
1 as test_smallint,
100 as test_int,
40000000000 as test_bigint,
true as test_boolean,
1.01 as test_float,
1.002 as test_double,
'gan' as test_string,
'DataDataData' as test_binary,
current_timestamp() as test_timestamp,
83.2 as test_decimal,
'char64' as test_char,
'varchar64' as test_varchar,
cast(substring(from_unixtime(unix_timestamp(cast('2016-01-01' as string), 'yyyy-MM-dd')),1,10) as date) as test_date,
array(1,2) as test_array,
map("name",cast('1.11' as float),"age",cast('1.11' as float)) as test_map,
NAMED_STRUCT('street', 'London', 'city','W1a9JF','state','Finished','zip', 123) as test_struct;

// insert 10 rows in the table partition(test_par1='par1', test_par2='par2_1')
insert into table test_hive_source partition(test_par1='par1', test_par2='par2_1') select 
test_tinyint,
 test_smallint,
test_int,
test_bigint,
test_boolean,
test_float,
test_double,
test_string,
test_binary,
test_timestamp,
test_decimal,
test_char,
test_varchar,
test_date,
test_array,
test_map,
test_struct from test_hive_source;

// insert 20 rows in the table partition(test_par1='par1_1', test_par2='par2_2')
insert into table test_hive_source partition(test_par1='par1_1', test_par2='par2_2') select 
test_tinyint,
 test_smallint,
test_int,
test_bigint,
test_boolean,
test_float,
test_double,
test_string,
test_binary,
test_timestamp,
test_decimal,
test_char,
test_varchar,
test_date,
test_array,
test_map,
test_struct from test_hive_source;

Total rows:40

10      par1    par2
10      par1    par2_1
20      par1_1  par2_2

1.1 Test in Flink Engine

1.1.1 Job Config File

env {
  # You can set flink configuration here
  execution.parallelism = 3
  job.name="test_hive_source_to_console"
}

source {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**

  Hive {
    table_name = "test_hive.test_hive_source"
    metastore_uri = "thrift://ctyun7:9083"
  }

  # If you would like to get more information about how to configure seatunnel and see full list of input plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/source-plugins/Fake
}

transform {

}

sink {
  # choose stdout output plugin to output data to console
  Console {
  }

  # If you would like to get more information about how to configure seatunnel and see full list of output plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/sink-plugins/Console
}

1.1.2 Submit Job Command

sh start-seatunnel-flink-connector-v2.sh --config ../config/flink_hive_to_console.conf 

1.1.3 Check Result in Flink Job log

The fields in data is lost, it can be fix in the feature #2473 The rows is right.

image

1.2 Test in Spark Engine

1.2.1 Job Config File

env {
  # You can set flink configuration here
  source.parallelism = 3
  job.name="test_hive_source_to_console"
}

source {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**

  Hive {
    table_name = "test_hive.test_hive_source"
    metastore_uri = "thrift://ctyun7:9083"
  }

  # If you would like to get more information about how to configure seatunnel and see full list of input plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/source-plugins/Fake
}

transform {

}

sink {
  # choose stdout output plugin to output data to console
  Console {
  }

  # If you would like to get more information about how to configure seatunnel and see full list of output plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/sink-plugins/Console
}

1.2.2 Submit Job Command --deploy-mode CLIENT --master local

sh start-seatunnel-spark-connector-v2.sh --config ../config/spark_hive_to_console.conf --deploy-mode client --master local

1.2.3 Check data result

The fields in data is lost, it can be fix in the feature. The rows is right.

image

1.2.4 Submit Job Command --deploy-mode client --master yarn

sh start-seatunnel-spark-connector-v2.sh --config ../config/spark_hive_to_console.conf --deploy-mode client --master yarn

1.2.5 Check data result

image

2. test orc file format table

create table test_hive_source_orc(
     test_tinyint                          TINYINT,
     test_smallint                       SMALLINT,
     test_int                                INT,
     test_bigint                           BIGINT,
     test_boolean                       BOOLEAN,
     test_float                             FLOAT,
     test_double                         DOUBLE,
     test_string                           STRING,
     test_binary                          BINARY,
     test_timestamp                  TIMESTAMP,
     test_decimal                       DECIMAL(8,2),
     test_char                             CHAR(64),
     test_varchar                        VARCHAR(64),
     test_date                             DATE,
     test_array                            ARRAY<INT>,
     test_map                              MAP<STRING, FLOAT>,
     test_struct                           STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
     )
PARTITIONED BY (test_par1 STRING, test_par2 STRING)
stored as orcfile;

The test data is same as text file format table.

2.1 Test in Flink engine

2.1.1 Job Config File

env {
  # You can set flink configuration here
  execution.parallelism = 3
  job.name="test_hiveorc_source_to_console"
}

source {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**

  Hive {
    table_name = "test_hive.test_hive_source_orc"
    metastore_uri = "thrift://ctyun7:9083"
  }

  # If you would like to get more information about how to configure seatunnel and see full list of input plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/source-plugins/Fake
}

transform {

}

sink {
  # choose stdout output plugin to output data to console
  Console {
  }

  # If you would like to get more information about how to configure seatunnel and see full list of output plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/sink-plugins/Console
}

2.1.2 Submit Job Command

 sh start-seatunnel-flink-connector-v2.sh --config ../config/flink_hiveorc_to_console.conf 

2.1.3 Check Data Result

The fields in data is right. image

The rows is right. image

2.2 Test in Spark Engine

2.2.1 Job Config File

env {
  # You can set flink configuration here
  source.parallelism = 3
  job.name="test_hiveorc_source_to_console"
}

source {
  # This is a example input plugin **only for test and demonstrate the feature input plugin**

  Hive {
    table_name = "test_hive.test_hive_source_orc"
    metastore_uri = "thrift://ctyun7:9083"
  }

  # If you would like to get more information about how to configure seatunnel and see full list of input plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/source-plugins/Fake
}

transform {

}

sink {
  # choose stdout output plugin to output data to console
  Console {
  }

  # If you would like to get more information about how to configure seatunnel and see full list of output plugins,
  # please go to https://seatunnel.apache.org/docs/flink/configuration/sink-plugins/Console
}

2.2.2 Submit Job Command

sh start-seatunnel-spark-connector-v2.sh --config ../config/spark_hiveorc_to_console.conf --deploy-mode client --master local

2.2.3 Check Data Result

3. test parquet file format table

create table test_hive_source_parquet(
     test_tinyint                          TINYINT,
     test_smallint                       SMALLINT,
     test_int                                INT,
     test_bigint                           BIGINT,
     test_boolean                       BOOLEAN,
     test_float                             FLOAT,
     test_double                         DOUBLE,
     test_string                           STRING,
     test_binary                          BINARY,
     test_timestamp                  TIMESTAMP,
     test_decimal                       DECIMAL(8,2),
     test_char                             CHAR(64),
     test_varchar                        VARCHAR(64),
     test_date                             DATE,
     test_array                            ARRAY<INT>,
     test_map                              MAP<STRING, FLOAT>,
     test_struct                           STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
     )
PARTITIONED BY (test_par1 STRING, test_par2 STRING)
stored as PARQUET;

EricJoy2048 avatar Sep 19 '22 07:09 EricJoy2048

please assign it to me , I will try it ;

john8628 avatar Sep 20 '22 01:09 john8628

please assign it to me , I will try it ;

Which issue do you want to work?

I do the test now and I will record all the problems found in the test and add todo list here. You can receive the issue you want to handle.

EricJoy2048 avatar Sep 20 '22 01:09 EricJoy2048

please assign it to me , I will try it ;

Which issue do you want to work?

I do the test now and I will record all the problems found in the test and add todo list here. You can receive the issue you want to handle.

I want to work for #2792 ,but it seems more difficult for me ;

john8628 avatar Sep 20 '22 05:09 john8628

please assign it to me , I will try it ;

Which issue do you want to work? I do the test now and I will record all the problems found in the test and add todo list here. You can receive the issue you want to handle.

I want to work for #2792 ,but it seems more difficult for me ;

I fix #2792 already. You can look at other tasks. We have many issues marked as good first issue and want help.

EricJoy2048 avatar Sep 20 '22 06:09 EricJoy2048

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Nov 19 '22 00:11 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Nov 29 '22 00:11 github-actions[bot]