OpenMetadata icon indicating copy to clipboard operation
OpenMetadata copied to clipboard

Sample Data Ingestion: Can not ingest tables with complex data types

Open nqvuong1998 opened this issue 1 year ago • 13 comments

Is your feature request related to a problem? Please describe. When ingesting sample data from Hive tables using Trino, we encounter an error: "Error trying to ingest sample data for table" when dealing with tables that have complex data types.

Describe the solution you'd like There are 2 solutions:

  1. When displaying sample data from Hive tables with complex data types such as struct, map, and array, it should match the schema structure.
  2. Convert complex sample data to a JSON string, and display it in one column, you can use a JSON representation for each row's complex data types.

nqvuong1998 avatar Jul 10 '24 09:07 nqvuong1998

Same issue here, it seems like the complex data type is not processed in OpenMetadata. Hue/Impala connector doesn't even process the complex data type at all, while Trino only processes it for Schema but not for Sample data/Data Profiler.

chuqbach avatar Jul 10 '24 09:07 chuqbach

probably related to https://github.com/open-metadata/OpenMetadata/issues/15627

sushi30 avatar Aug 20 '24 17:08 sushi30

Hello @nqvuong1998 we will discuss internally and see which release can it be a part of. Until then, it would be great if you can provide us with the DDL of the table. Also, as it's open source we encourage people to contribute, let us know if you want to contribute, we will help wherever needed, Thanks 🙏

ayush-shah avatar Sep 23 '24 04:09 ayush-shah

@nqvuong1998 can you share OpenMetadata version you are on and any logs you have as well as the table DDL? We could not reproduce it on our end and JSON/STRUCT field for sample data are ingested as expected

TeddyCr avatar Oct 14 '24 13:10 TeddyCr

Hi @TeddyCr ,

  • We updated OpenMetadata to 1.5.6 (the latest version).
  • DDL: SHOW CREATE TABLE pmc.curated_pmc_promotion_transaction_prod_event_v1_sid72;
CREATE TABLE pmc.curated_pmc_promotion_transaction_prod_event_v1_sid72 (
    key STRING,
    payload STRUCT<
        promotiontransactionid: STRING,
        validto: STRING,
        vouchercode: STRING,
        vouchername: STRING,
        flowapplied: STRING,
        status: STRING,
        reftransactionid: STRING,
        initialoriginalamount: DECIMAL(19,2),
        discountamount: DECIMAL(19,2),
        initialfinalamount: DECIMAL(19,2),
        initialactualamount: DECIMAL(19,2),
        cuid: STRING,
        contractnumber: STRING,
        supplementid: STRING,
        creationdate: STRING,
        adjustmenthistory: ARRAY<STRUCT<
            id: STRING,
            refundrequestid: STRING,
            refundamount: DECIMAL(19,2),
            status: STRING,
            refundresulttime: STRING
        >>,
        prcode: STRING,
        campaigncode: STRING,
        paymentmethod: STRING
    >,
    kafka_topic STRING,
    kafka_partition INT,
    kafka_offset BIGINT,
    kafka_timestamp TIMESTAMP,
    kafka_timestamp_type INT,
    ingested_by STRING,
    ingestion_time TIMESTAMP,
    hour INT,
    hash STRING
)
PARTITIONED BY (
    date BIGINT
)
WITH SERDEPROPERTIES (
    'partitionOverwriteMode' = 'dynamic',
    'path' = 'hdfs://nameservice1/user/hive/warehouse/pmc.db/curated_pmc_promotion_transaction_prod_event_v1_sid72',
    'serialization.format' = '1'
)
STORED AS PARQUET
LOCATION 'hdfs://nameservice1/user/hive/warehouse/pmc.db/curated_pmc_promotion_transaction_prod_event_v1_sid72'
  • Logs:
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] +--------------+-------------------------------------------------------------------------+----------------------------------------------+---------------+
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] | From         | Entity Name                                                             | Message                                      | Stack Trace   |
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] +==============+=========================================================================+==============================================+===============+
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] | OpenMetadata | trino_bdp.bdp.pmc.curated_pmc_promotion_transaction_prod_event_v1_sid43 | Error trying to ingest sample data for table |               |
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] +--------------+-------------------------------------------------------------------------+----------------------------------------------+---------------+
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] | OpenMetadata | trino_bdp.bdp.pmc.curated_pmc_promotion_transaction_prod_event_v1_sid47 | Error trying to ingest sample data for table |               |
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] +--------------+-------------------------------------------------------------------------+----------------------------------------------+---------------+
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] | OpenMetadata | trino_bdp.bdp.pmc.curated_pmc_promotion_transaction_prod_event_v1_sid72 | Error trying to ingest sample data for table |               |
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] +--------------+-------------------------------------------------------------------------+----------------------------------------------+---------------+
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] [2024-10-13 02:52:30] INFO     {metadata.Utils:logger:178} - Success %: 75.0
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] [2024-10-13 02:52:30] INFO     {metadata.Utils:logger:178} - Workflow finished in time: 4.0m 19.15s
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base] Traceback (most recent call last):
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]   File "/usr/local/bin/metadata", line 8, in <module>
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]     sys.exit(metadata())
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]              ^^^^^^^^^^
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/metadata/cmd.py", line 156, in metadata
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]     RUN_PATH_METHODS[metadata_workflow](path)
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/metadata/cli/profile.py", line 51, in run_profiler
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]     workflow.raise_from_status()
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/metadata/workflow/workflow_status_mixin.py", line 134, in raise_from_status
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]     raise err
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/metadata/workflow/workflow_status_mixin.py", line 131, in raise_from_status
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]     self.raise_from_status_internal(raise_warnings)
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]   File "/usr/local/lib/python3.11/site-packages/metadata/workflow/ingestion.py", line 163, in raise_from_status_internal
[2024-10-13, 09:52:34 +07] {pod_manager.py:472} INFO - [base]     raise WorkflowExecutionError(
[2024-10-13, 09:52:34 +07] {pod_manager.py:490} INFO - [base] metadata.config.common.WorkflowExecutionError: OpenMetadata reported errors: OpenMetadata Summary: [3 Records, [0 Updated Records, 0 Warnings, 3 Errors, 0 Filtered]
  • Expectation: OM should support show raw json format. For a dummy example:
{
  "key": "abc123",
  "payload": {
    "promotiontransactionid": "promo_001",
    "validto": "2024-12-31",
    "vouchercode": "VOUCHER2024",
    "vouchername": "Holiday Discount",
    "flowapplied": "Purchase",
    "status": "Active",
    "reftransactionid": "ref_12345",
    "initialoriginalamount": 100.00,
    "discountamount": 20.00,
    "initialfinalamount": 80.00,
    "initialactualamount": 80.00,
    "cuid": "cuid_67890",
    "contractnumber": "CNTR2024",
    "supplementid": "SUPP123",
    "creationdate": "2024-10-15T10:00:00Z",
    "adjustmenthistory": [
      {
        "id": "adj_001",
        "refundrequestid": "rr_001",
        "refundamount": 10.00,
        "status": "Refunded",
        "refundresulttime": "2024-10-16T11:00:00Z"
      },
      {
        "id": "adj_002",
        "refundrequestid": "rr_002",
        "refundamount": 5.00,
        "status": "Pending",
        "refundresulttime": "2024-10-17T12:00:00Z"
      }
    ],
    "prcode": "PRCODE2024",
    "campaigncode": "CMP2024",
    "paymentmethod": "Credit Card"
  },
  "kafka_topic": "promotion_events",
  "kafka_partition": 1,
  "kafka_offset": 123456,
  "kafka_timestamp": "2024-10-15T10:05:00Z",
  "kafka_timestamp_type": 0,
  "ingested_by": "user_001",
  "ingestion_time": "2024-10-15T10:06:00Z",
  "hour": 10,
  "hash": "abcdef1234567890",
  "date": 20241015
}

nqvuong1998 avatar Oct 15 '24 04:10 nqvuong1998

Can you share the full log files (if you can run it with Debug that would be helpful). Feel free to DM it to me in our slack channel. I see 3 errors in there -- would be interested to see what it is.

TeddyCr avatar Oct 16 '24 12:10 TeddyCr

Hi @TeddyCr @ayush-shah ,

  • Trino Profiler config:
source:
  type: trino
  serviceName: trino_bdp
  serviceConnection:
    config:
      type: Trino
      hostPort: $TRINO_HOST_PORT
      username: $TRINO_USERNAME
      authType:
        # For basic auth
        password: $TRINO_PASSWORD
      catalog: bdp
      connectionArguments:
        verify: /data/ca.pem
  sourceConfig:
    config:
      type: Profiler
      generateSampleData: true
      sampleDataCount: 70
      computeMetrics: false
      profileSampleType: PERCENTAGE
      profileSample: 100
      processPiiSensitive: false
      confidence: 80
      threadCount: 5
      timeoutSeconds: 43200
      includeViews: false
      schemaFilterPattern:
        includes:
          - ^pmc$
processor:
  type: orm-profiler
  config: {}
sink:
  type: metadata-rest
  config: {}
workflowConfig:
  loggerLevel: DEBUG
  openMetadataServerConfig:
    hostPort: $OM_HOST_PORT
    authProvider: openmetadata
    securityConfig:
      jwtToken: $OM_JWT_TOKEN
    ## Store the service Connection information
    storeServiceConnection: false
    ## If SSL, fill the following
    verifySSL: validate
    sslConfig:
      caCertificate: /data/ca.pem

nqvuong1998 avatar Oct 17 '24 03:10 nqvuong1998

Hi @TeddyCr @ayush-shah , any updates for this issue?

nqvuong1998 avatar Dec 20 '24 03:12 nqvuong1998

We will take care of this as part of 1.7 or minor releases post 1.7

ayush-shah avatar Jan 15 '25 16:01 ayush-shah

Moving this to 1.7.1

ayush-shah avatar Mar 15 '25 04:03 ayush-shah

@ayush-shah are you looking into this?

harshach avatar Apr 23 '25 14:04 harshach

Hi @ayush-shah @TeddyCr , will this feature release in 1.7.x?

nqvuong1998 avatar May 05 '25 14:05 nqvuong1998

Hey @nqvuong1998 yes, we are planning it for 1.7.1 right now.

TeddyCr avatar May 06 '25 13:05 TeddyCr

hey @Prajwal214 As discussed, here is the sample data and DDL for a Trino table containing complex data types.

DDL+sample.txt

aimendenche-nw avatar Jul 02 '25 07:07 aimendenche-nw