druid icon indicating copy to clipboard operation
druid copied to clipboard

Add example for nested columns with streaming

Open techdocsmith opened this issue 3 years ago • 5 comments

Request from community member for a streaming example with nested JSON, assuming support.

Per @gianm Usage with streaming would be similar to native batch: the dimensionsSpec and transformSpec work the same way

techdocsmith avatar Oct 03 '22 17:10 techdocsmith

Relevant Slack thread (link good for 90 days): https://apachedruidworkspace.slack.com/archives/C0303FDCZEZ/p1664492820183909

Original question:

Looking forward to trying out nested columns Is this feature currently supported for streaming ingestion? It appears not since documentation calls out SQL based and batch ingestion methods but not streaming.

The response:

I believe it is supported for streaming ingestion, we just don't have an example Usage with streaming would be similar to native batch: the dimensionsSpec and transformSpec work the same way

gianm avatar Oct 03 '22 21:10 gianm

The doc in question is this page: https://druid.apache.org/docs/latest/querying/nested-columns.html

gianm avatar Oct 03 '22 21:10 gianm

Thanks for the clarification, @gianm

techdocsmith avatar Oct 03 '22 22:10 techdocsmith

Just tested it by using the kafka tutorial but replacing the wikipedia data with kttm nested data: Steps:

Create the topic ./bin/kafka-topics.sh --create --topic kttm_nested --bootstrap-server localhost:9092

Get the nested data from kttm nested example:

curl https://static.imply.io/example-data/kttm-nested-v2/kttm-nested-v2-2019-08-25.json.gz -o kttm-nested-data.json.gz
gunzip -c kttm-nested-data.json.gz > kttm-nested-data.json

Publish to the topic:

export KAFKA_OPTS="-Dfile.encoding=UTF-8"
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic kttm_nested < kttm-nested-data.json

The UI for "Load Data" does not automatically recognize the nested JSON columns in the parsing step. In the "Configure Schema" step, you can use "Add dimension", type the name and choose type "json".

The resulting Ingestion Spec:

  "type": "kafka",
  "spec": {
    "ioConfig": {
      "type": "kafka",
      "consumerProperties": {
        "bootstrap.servers": "localhost:9092"
      },
      "topic": "kttm_nested",
      "inputFormat": {
        "type": "json"
      },
      "useEarliestOffset": true
    },
    "tuningConfig": {
      "type": "kafka"
    },
    "dataSchema": {
      "dataSource": "kttm_nested",
      "timestampSpec": {
        "column": "timestamp",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
          "session",
          "number",
          "client_ip",
          "language",
          "adblock_list",
          "app_version",
          "path",
          "loaded_image",
          "referrer",
          "referrer_host",
          "server_ip",
          "screen",
          "window",
          {
            "type": "long",
            "name": "session_length"
          },
          "timezone",
          "timezone_offset",
          {
            "type": "json",
            "name": "event"
          },
          {
            "type": "json",
            "name": "agent"
          },
          {
            "type": "json",
            "name": "geo_ip"
          }
        ]
      },
      "granularitySpec": {
        "queryGranularity": "none",
        "rollup": false,
        "segmentGranularity": "hour"
      }
    }
  }
}

@techdocsmith, This example works, but it requires the kafka setup steps to run, so I'm not sure if it fits in the nested columns docs page as is. Perhaps adjust the kafka tutorial so it uses this source instead? Let me know how else to help.

sergioferragut avatar Oct 06 '22 22:10 sergioferragut

@sergioferragut , it could potentially go both places. on Nested columns to show it's possible in a supervisor spec & in the tutorial too. Thanks for sharing!

techdocsmith avatar Oct 07 '22 00:10 techdocsmith