kafka
kafka copied to clipboard
KAFKA-9436-1: (Simplified than 9436 PR) New Kafka Connect SMT for plainText => Struct(or Map)
Re branching and PR (about #7965) with reviewed fix from Chris Egerton ( https://lists.apache.org/thread/xb57l7j953k8dfgqvktb09y31vzpm1xx https://lists.apache.org/thread/20954n2g5wjdrts740ft3rnlx1ogh7gb https://lists.apache.org/thread/7t1k0ko8l973v4oj3l983j7qpwolhyzf )
- I wonder if it's necessary to include support for type casting with this SMT. We already have a Cast SMT ( https://github.com/apache/kafka/blob/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/Cast.java) that can parse multiple fields of a structured record value with differing types. Would it be enough for your new SMT to only produce string values for its structured data, and then allow users to perform casting logic using the Cast SMT afterward?
- It seems like the "struct.field" property is similar; based on the examples, it looks like when the SMT is configured with a value for that property, it will first pull out a field from a structured record value (for example, it would pull out the value " https://kafka.apache.org/documentation/#connect" from a map of {"url": " https://kafka.apache.org/documentation/#connect"}), then parse that field's value, and replace the entire record value (or key) with the result of the parsing stage. It seems like this could be accomplished using the ExtractField SMT ( https://github.com/apache/kafka/blob/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ExtractField.java) as a preliminary step before passing it to your new SMT. Is this correct? And if so, could we simplify the interface for your SMT by removing the "struct.field" property in favor of the existing ExtractField SMT?
- CAST function removed ( use combination with Cast SMT (
- struct.field option removed ( use combination with EXtractField SMT )
New SMT
plain text => struct(map) regex group condition with ordered key name compatible with single plain text input and struct field input plain text
sample1
"111.61.73.113 - - [08/Aug/2019:18:15:29 +0900] \"OPTIONS /api/v1/service_config HTTP/1.1\" 200 - 101989 \"http://local.test.com/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36\""
SMT connect config with regular expression below can easily transform a plain text to struct (or map) data.
"transforms": "TimestampTopic, RegexTransform",
"transforms.RegexTransform.type": "org.apache.kafka.connect.transforms.ParseStructByRegex$Value",
"transforms.RegexTransform.regex": "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(GET|POST|OPTIONS|HEAD|PUT|DELETE|PATCH) (.+?) (.+?)\" (\\d{3}) ([0-9|-]+) ([0-9|-]+) \"([^\"]+)\" \"([^\"]+)\""
"transforms.RegexTransform.mapping": "IP,RemoteUser,AuthedRemoteUser,DateTime,Method,Request,Protocol,Response,BytesSent,Ms,Referrer,UserAgent"
sample2
dev_kafka_pc001_1580372261372"
"transforms": "RegexTransform",
"transforms.RegexTransform.type": "org.apache.kafka.connect.transforms.ParseStructByRegex$Value",
"transforms.RegexTransform.regex": "^(.{3,4})_(.*)_(pc|mw|ios|and)([0-9]{3})_([0-9]{13})" "transforms.RegexTransform.mapping": "env,serviceId,device,sequence,datetime"