kafka-connect-hdfs icon indicating copy to clipboard operation
kafka-connect-hdfs copied to clipboard

Over Hive `avro.schema.literal` max size due to so long Avro schema

Open daigorowhite opened this issue 7 years ago • 7 comments

Hi team, I have one issue with long Avro schema with kafka-connect-hdfs hive integration.

When I try to sink long schema table into HDFS with kafka-connect-hdfs.

Success to put data into HDFS , but I got this error when I throw query

0: jdbc:hive2://localhost:10000/default> select * from db.table limit 1 ;
Error: Error while compiling statement: FAILED: RuntimeException MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered exception determining schema. Returning signal schema to indicate problem: org.codehaus.jackson.JsonParseException: Unexpected end-of-input: was expecting closing quote for a string value
 at [Source: java.io.StringReader@2c87f90b; line: 1, column: 6001]) (state=42000,code=40000)

I investigated root cause of this, and it is caused the Hive meta data param varchar size. https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/avro/AvroHiveUtil.java#L69 https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/avro/AvroHiveUtil.java#L95

mysql> describe TABLE_PARAMS;
+-------------+---------------+------+-----+---------+-------+
| Field       | Type          | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+-------+
| TBL_ID      | bigint(20)    | NO   | PRI | NULL    |       |
| PARAM_KEY   | varchar(256)  | NO   | PRI | NULL    |       |
| PARAM_VALUE | varchar(4000) | YES  |     | NULL    |       |
+-------------+---------------+------+-----+---------+-------+

Do you have any solution/idea for this in kafka-connect-hdfs?

daigorowhite avatar May 29 '18 06:05 daigorowhite

I've seen this workaround: log into Hive Metastore DB, then run alter table SERDE_PARAMS MODIFY PARAM_VALUE VARCHAR(40000);

Probably want to pick a realistic size, though.

rhauch avatar May 29 '18 23:05 rhauch

It is one way to avoid this problem, but I think it is depended Hive Setting side and could be shorten again by Hive upgrade. How about just use avro.schema.url to have un-limited size .avro ?

daigorowhite avatar May 30 '18 00:05 daigorowhite

@daigorowhite yes, using avro.schema.url is fine as long as it can be stored somewhere. HDFS is an obvious place, and that requires non-trivial code changes to handle all of the cases (e.g., the schema changes in a consumed topic and needs to be stored somewhere).

I only mentioned the workaround mentioned because it works today and requires no code changes. It may not be ideal, and it may not work for everyone.

rhauch avatar May 30 '18 15:05 rhauch

See HIVE-12274.

You could manually apply the upgrade script for Mysql

OneCricketeer avatar Jun 08 '18 19:06 OneCricketeer

Thanks for sharing it! 👍

daigorowhite avatar Jun 12 '18 01:06 daigorowhite

Duplicates #145

OneCricketeer avatar Sep 11 '18 01:09 OneCricketeer

Hi, team. With hive.integration=true, how can kafka-connect-sink use avro.schema.url instead of avro.schema.literal. Or I need alter table manually in Hive?

Vincent-Zeng avatar Jul 11 '20 06:07 Vincent-Zeng