OpenMetadata icon indicating copy to clipboard operation
OpenMetadata copied to clipboard

S3 Storage structureFormat parquet issue

Open mykola-yesypchuk-inflection opened this issue 9 months ago • 2 comments

Affected module Ingestion Framework

Describe the bug Failed to run S3 storage metadata ingestion due _SUCCESS file in dataPath entries folder.

To Reproduce openmetadata.json

{
    "entries": [
        {
            "dataPath": "data/sp_entity",
            "structureFormat": "parquet",
            "isPartitioned": false
        }
    ]
}

Aiflow logs

[2024-04-30, 13:21:32 UTC] {metadata.py:429} INFO - Looking for metadata template file at - s3://test-bucket/openmetadata.json
[2024-04-30, 13:21:33 UTC] {metadata.py:246} INFO - Extracting metadata from path data/sp_entity and generating structured container
[2024-04-30, 13:21:33 UTC] {metadata.py:365} INFO - File data/sp_entity/part-00000-764565f7-45f7-416c-a7aa-8932bc1ebf83-c000.snappy.parquet was picked to infer data structure from.
[2024-04-30, 13:21:40 UTC] {metadata.py:143} INFO - Extracting metadata from path data/sp_entity and generating structured container
[2024-04-30, 13:21:41 UTC] {metadata.py:365} INFO - File data/sp_entity/_SUCCESS was picked to infer data structure from.
[2024-04-30, 13:21:42 UTC] {datalake_utils.py:69} ERROR - Error fetching file [test-bucket/data/sp_entity/_SUCCESS] using [S3Config] due to: [Error reading dataframe due to [Could not open Parquet input source 's3://test-bucket/data/sp_entity/_SUCCESS': Parquet file size is 0 bytes]]
[2024-04-30, 13:21:42 UTC] {status.py:76} WARNING - Wild error while creating Container from bucket details - 'NoneType' object has no attribute 'columns'
[2024-04-30, 13:21:42 UTC] {taskinstance.py:1937} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 192, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 209, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/openmetadata_managed_apis/workflows/ingestion/common.py", line 209, in metadata_ingestion_workflow
    workflow.raise_from_status()
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/workflow/workflow_status_mixin.py", line 125, in raise_from_status
    raise err
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/workflow/workflow_status_mixin.py", line 122, in raise_from_status
    self.raise_from_status_internal(raise_warnings)
  File "/home/airflow/.local/lib/python3.10/site-packages/metadata/workflow/ingestion.py", line 149, in raise_from_status_internal
    raise WorkflowExecutionError(
metadata.config.common.WorkflowExecutionError: S3 reported errors: S3 Summary: [1 Records, [0 Updated Records, 0 Warnings, 1 Errors, 91 Filtered]

Expected behavior Ignore _SUCCESS file - ??? Run job without exception.

Version:

  • OS: Windows
  • Python version: 3.10
  • OpenMetadata version: 1.3.3
  • OpenMetadata deploy: https://docs.open-metadata.org/v1.3.x/quick-start/local-docker-deployment

Also I see that s3 container shows wrong stats. It seems stats from current bucket, not from table container itself. OpenMetadata code: https://github.com/open-metadata/OpenMetadata/blob/89b083b6f260bbea3a8f2a2246c4fd9487baa433/ingestion/src/metadata/ingestion/source/storage/s3/metadata.py#L227-L232 Screenshots: image image image

Do we have any updates on that?

@pmbrull can you review this?

harshach avatar Jun 22 '24 21:06 harshach