OpenMetadata
OpenMetadata copied to clipboard
S3 Storage structureFormat parquet issue
Affected module Ingestion Framework
Describe the bug Failed to run S3 storage metadata ingestion due _SUCCESS file in dataPath entries folder.
To Reproduce openmetadata.json
{
"entries": [
{
"dataPath": "data/sp_entity",
"structureFormat": "parquet",
"isPartitioned": false
}
]
}
Aiflow logs
[2024-04-30, 13:21:32 UTC] {metadata.py:429} INFO - Looking for metadata template file at - s3://test-bucket/openmetadata.json
[2024-04-30, 13:21:33 UTC] {metadata.py:246} INFO - Extracting metadata from path data/sp_entity and generating structured container
[2024-04-30, 13:21:33 UTC] {metadata.py:365} INFO - File data/sp_entity/part-00000-764565f7-45f7-416c-a7aa-8932bc1ebf83-c000.snappy.parquet was picked to infer data structure from.
[2024-04-30, 13:21:40 UTC] {metadata.py:143} INFO - Extracting metadata from path data/sp_entity and generating structured container
[2024-04-30, 13:21:41 UTC] {metadata.py:365} INFO - File data/sp_entity/_SUCCESS was picked to infer data structure from.
[2024-04-30, 13:21:42 UTC] {datalake_utils.py:69} ERROR - Error fetching file [test-bucket/data/sp_entity/_SUCCESS] using [S3Config] due to: [Error reading dataframe due to [Could not open Parquet input source 's3://test-bucket/data/sp_entity/_SUCCESS': Parquet file size is 0 bytes]]
[2024-04-30, 13:21:42 UTC] {status.py:76} WARNING - Wild error while creating Container from bucket details - 'NoneType' object has no attribute 'columns'
[2024-04-30, 13:21:42 UTC] {taskinstance.py:1937} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 192, in execute
return_value = self.execute_callable()
File "/home/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 209, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/home/airflow/.local/lib/python3.10/site-packages/openmetadata_managed_apis/workflows/ingestion/common.py", line 209, in metadata_ingestion_workflow
workflow.raise_from_status()
File "/home/airflow/.local/lib/python3.10/site-packages/metadata/workflow/workflow_status_mixin.py", line 125, in raise_from_status
raise err
File "/home/airflow/.local/lib/python3.10/site-packages/metadata/workflow/workflow_status_mixin.py", line 122, in raise_from_status
self.raise_from_status_internal(raise_warnings)
File "/home/airflow/.local/lib/python3.10/site-packages/metadata/workflow/ingestion.py", line 149, in raise_from_status_internal
raise WorkflowExecutionError(
metadata.config.common.WorkflowExecutionError: S3 reported errors: S3 Summary: [1 Records, [0 Updated Records, 0 Warnings, 1 Errors, 91 Filtered]
Expected behavior Ignore _SUCCESS file - ??? Run job without exception.
Version:
- OS: Windows
- Python version: 3.10
- OpenMetadata version: 1.3.3
- OpenMetadata deploy: https://docs.open-metadata.org/v1.3.x/quick-start/local-docker-deployment
Also I see that s3 container shows wrong stats. It seems stats from current bucket, not from table container itself.
OpenMetadata code:
https://github.com/open-metadata/OpenMetadata/blob/89b083b6f260bbea3a8f2a2246c4fd9487baa433/ingestion/src/metadata/ingestion/source/storage/s3/metadata.py#L227-L232
Screenshots:
Do we have any updates on that?
@pmbrull can you review this?