kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-9436-1: (Simplified than 9436 PR) New Kafka Connect SMT for plainText => Struct(or Map)

Open whsoul opened this issue 3 years ago • 0 comments

KIP link

Re branching and PR (about #7965) with reviewed fix from Chris Egerton ( https://lists.apache.org/thread/xb57l7j953k8dfgqvktb09y31vzpm1xx https://lists.apache.org/thread/20954n2g5wjdrts740ft3rnlx1ogh7gb https://lists.apache.org/thread/7t1k0ko8l973v4oj3l983j7qpwolhyzf )

  1. I wonder if it's necessary to include support for type casting with this SMT. We already have a Cast SMT ( https://github.com/apache/kafka/blob/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/Cast.java) that can parse multiple fields of a structured record value with differing types. Would it be enough for your new SMT to only produce string values for its structured data, and then allow users to perform casting logic using the Cast SMT afterward?
  1. It seems like the "struct.field" property is similar; based on the examples, it looks like when the SMT is configured with a value for that property, it will first pull out a field from a structured record value (for example, it would pull out the value " https://kafka.apache.org/documentation/#connect" from a map of {"url": " https://kafka.apache.org/documentation/#connect"}), then parse that field's value, and replace the entire record value (or key) with the result of the parsing stage. It seems like this could be accomplished using the ExtractField SMT ( https://github.com/apache/kafka/blob/trunk/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ExtractField.java) as a preliminary step before passing it to your new SMT. Is this correct? And if so, could we simplify the interface for your SMT by removing the "struct.field" property in favor of the existing ExtractField SMT?
  1. CAST function removed ( use combination with Cast SMT (
  2. struct.field option removed ( use combination with EXtractField SMT )

New SMT

plain text => struct(map) regex group condition with ordered key name compatible with single plain text input and struct field input plain text

sample1

"111.61.73.113 - - [08/Aug/2019:18:15:29 +0900] \"OPTIONS /api/v1/service_config HTTP/1.1\" 200 - 101989 \"http://local.test.com/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36\""
SMT connect config with regular expression below can easily transform a plain text to struct (or map) data.
"transforms": "TimestampTopic, RegexTransform",
"transforms.RegexTransform.type": "org.apache.kafka.connect.transforms.ParseStructByRegex$Value",
"transforms.RegexTransform.regex": "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(GET|POST|OPTIONS|HEAD|PUT|DELETE|PATCH) (.+?) (.+?)\" (\\d{3}) ([0-9|-]+) ([0-9|-]+) \"([^\"]+)\" \"([^\"]+)\""

"transforms.RegexTransform.mapping": "IP,RemoteUser,AuthedRemoteUser,DateTime,Method,Request,Protocol,Response,BytesSent,Ms,Referrer,UserAgent"

sample2

dev_kafka_pc001_1580372261372"
"transforms": "RegexTransform",
"transforms.RegexTransform.type": "org.apache.kafka.connect.transforms.ParseStructByRegex$Value",

"transforms.RegexTransform.regex": "^(.{3,4})_(.*)_(pc|mw|ios|and)([0-9]{3})_([0-9]{13})" "transforms.RegexTransform.mapping": "env,serviceId,device,sequence,datetime"

whsoul avatar May 26 '22 02:05 whsoul