kafka-connect-hdfs
kafka-connect-hdfs copied to clipboard
Over Hive `avro.schema.literal` max size due to so long Avro schema
Hi team, I have one issue with long Avro schema with kafka-connect-hdfs hive integration.
When I try to sink long schema table into HDFS with kafka-connect-hdfs.
Success to put data into HDFS , but I got this error when I throw query
0: jdbc:hive2://localhost:10000/default> select * from db.table limit 1 ;
Error: Error while compiling statement: FAILED: RuntimeException MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered exception determining schema. Returning signal schema to indicate problem: org.codehaus.jackson.JsonParseException: Unexpected end-of-input: was expecting closing quote for a string value
at [Source: java.io.StringReader@2c87f90b; line: 1, column: 6001]) (state=42000,code=40000)
I investigated root cause of this, and it is caused the Hive meta data param varchar size.
https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/avro/AvroHiveUtil.java#L69
https://github.com/confluentinc/kafka-connect-hdfs/blob/master/src/main/java/io/confluent/connect/hdfs/avro/AvroHiveUtil.java#L95
mysql> describe TABLE_PARAMS;
+-------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+-------+
| TBL_ID | bigint(20) | NO | PRI | NULL | |
| PARAM_KEY | varchar(256) | NO | PRI | NULL | |
| PARAM_VALUE | varchar(4000) | YES | | NULL | |
+-------------+---------------+------+-----+---------+-------+
Do you have any solution/idea for this in kafka-connect-hdfs?
I've seen this workaround: log into Hive Metastore DB, then run alter table SERDE_PARAMS MODIFY PARAM_VALUE VARCHAR(40000);
Probably want to pick a realistic size, though.
It is one way to avoid this problem, but I think it is depended Hive Setting side and could be shorten again by Hive upgrade.
How about just use avro.schema.url to have un-limited size .avro ?
@daigorowhite yes, using avro.schema.url is fine as long as it can be stored somewhere. HDFS is an obvious place, and that requires non-trivial code changes to handle all of the cases (e.g., the schema changes in a consumed topic and needs to be stored somewhere).
I only mentioned the workaround mentioned because it works today and requires no code changes. It may not be ideal, and it may not work for everyone.
Thanks for sharing it! 👍
Duplicates #145
Hi, team.
With hive.integration=true, how can kafka-connect-sink use avro.schema.url instead of avro.schema.literal. Or I need alter table manually in Hive?