OpenMetadata
OpenMetadata copied to clipboard
2 issues with the metadata integration job for an MSSQL database : [utf-16-le' codec can't decode bytes] + [Multiple columns tagged with primary key constraints]
Hi,
I just installed OpenMetadata v1.1.7 with a Docker container and I'm having an issue with the metadata integration job for my MSSQL database.
1 ) utf-16-le' codec can't decode bytes :
e2dee958a89c *** Found local files: *** * /opt/airflow/logs/dag_id=bdGes_foo_TEST_metadata_sR1SBJPV/run_id=manual__2023-10-07T07:17:35+00:00/task_id=ingestion_task/attempt=1.log *** Could not read served logs: Parent instance <TaskInstance at 0x7f89b4a617f0> is not bound to a Session; lazy load operation of attribute 'trigger' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3) [2023-10-07T07:17:36.402+0000] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: bdGes_foo_TEST_metadata_sR1SBJPV.ingestion_task manual__2023-10-07T07:17:35+00:00 [queued]> [2023-10-07T07:17:36.579+0000] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: bdGes_foo_TEST_metadata_sR1SBJPV.ingestion_task manual__2023-10-07T07:17:35+00:00 [queued]> [2023-10-07T07:17:36.579+0000] {taskinstance.py:1308} INFO - Starting attempt 1 of 1 [2023-10-07T07:17:36.593+0000] {taskinstance.py:1327} INFO - Executing <Task(PythonOperator): ingestion_task> on 2023-10-07 07:17:35+00:00 [2023-10-07T07:17:36.601+0000] {standard_task_runner.py:57} INFO - Started process 162537 to run task [2023-10-07T07:17:36.607+0000] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', 'bdGes_foo_TEST_metadata_sR1SBJPV', 'ingestion_task', 'manual__2023-10-07T07:17:35+00:00', '--job-id', '99', '--raw', '--subdir', 'DAGS_FOLDER/bdGes_foo_TEST_metadata_sR1SBJPV.py', '--cfg-path', '/tmp/tmpan1meypy'] [2023-10-07T07:17:36.608+0000] {standard_task_runner.py:85} INFO - Job 99: Subtask ingestion_task [2023-10-07T07:17:36.681+0000] {task_command.py:410} INFO - Running <TaskInstance: bdGes_foo_TEST_metadata_sR1SBJPV.ingestion_task manual__2023-10-07T07:17:35+00:00 [running]> on host e2dee958a89c [2023-10-07T07:17:36.797+0000] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='admin' AIRFLOW_CTX_DAG_ID='bdGes_foo_TEST_metadata_sR1SBJPV' AIRFLOW_CTX_TASK_ID='ingestion_task' AIRFLOW_CTX_EXECUTION_DATE='2023-10-07T07:17:35+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2023-10-07T07:17:35+00:00' [2023-10-07T07:17:36.827+0000] {workflow.py:115} INFO - Service type:ServiceType.Database,mssql configured [2023-10-07T07:17:36.940+0000] {test_connections.py:215} INFO - Test connection results: [2023-10-07T07:17:36.940+0000] {test_connections.py:216} INFO - failed=[] success=["'CheckAccess': Pass", "'GetDatabases': Pass", "'GetSchemas': Pass", "'GetTables': Pass", "'GetViews': Pass", "'GetQueries': Pass"] warning=[] [2023-10-07T07:17:37.236+0000] {common_db_source.py:117} INFO - Ingesting from database: bdGes_foo_TEST [2023-10-07T07:17:38.231+0000] {common_db_source.py:424} WARNING - Unexpected exception to yield table [agessa_mem]: (builtins.UnicodeDecodeError) 'utf-16-le' codec can't decode bytes in position 14-15: illegal encoding (Background on this error at: https://sqlalche.me/e/14/dbapi)
--> The table is empty.
2 ) Multiple columns tagged with primary key constraints :
[2023-10-07T07:17:38.777+0000] {metadata_rest.py:147} WARNING - Failed to ingest CreateTableRequest [Art_notes] due to api request failure: Multiple columns tagged with primary key constraints
---> https://stackoverflow.com/questions/18357675/unicodedecodeerror-unexpected-end-of-data
hi @AnZelie, thanks for the details. A couple of questions if I may:
- Can you please share the schema of the table
agessa_mem? Let's see if we can reproduce the error - For the table
Art_notes, do you have more than one PK? (maybe a multicol PK?)
It's strange, because if I recreate the table with a different name, I don't get any errors on the new table.
CREATE TABLE [dbo].[agessa_mem]( [Numero] [int] IDENTITY(1,1) NOT NULL, [zzz] [nvarchar](50) NULL, [code_edit] [nvarchar](50) NULL, [agessa1] [int] NULL, [csg1] [int] NULL, [unpourc1] [int] NULL, [ret_total1] [int] NULL, [rds1] [int] NULL, [agessa2] [int] NULL, [agessa3] [int] NULL, [agessa4] [int] NULL, [csg2] [int] NULL, [csg3] [int] NULL, [csg4] [int] NULL, [rds2] [int] NULL, [rds3] [int] NULL, [rds4] [int] NULL, [unpourc2] [int] NULL, [unpourc3] [int] NULL, [unpourc4] [int] NULL, [ret_total2] [int] NULL, [ret_total3] [int] NULL, [ret_total4] [int] NULL, [groupe] [nvarchar](50) NULL, [annee] [nvarchar](50) NULL, [ages_recalc] [int] NULL, [csg_recalc] [int] NULL, [rds_recalc] [int] NULL, [unp_recalc] [int] NULL, [tot_recalc] [int] NULL, CONSTRAINT [aa_abcaaaaaagessa_mem_PK] PRIMARY KEY NONCLUSTERED ( [Numero] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY] ) ON [PRIMARY]
It seems that there isn't anything specific I've noticed regarding the primary key (PK). However, on my SQL server, several databases have the same table name with the same primary key name.
ALTER TABLE [dbo].[Art_notes] ADD CONSTRAINT [aaaaaArt_notes_PK] PRIMARY KEY NONCLUSTERED ( [Numero] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
NB : At the end of the create table script, i have this code :
EXEC sys.sp_addextendedproperty @name=N'AllowZeroLength', @value=N'False' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO
EXEC sys.sp_addextendedproperty @name=N'Attributes', @value=N'17' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO
EXEC sys.sp_addextendedproperty @name=N'CollatingOrder', @value=N'1036' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO
EXEC sys.sp_addextendedproperty @name=N'ColumnHidden', @value=N'False' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO
EXEC sys.sp_addextendedproperty @name=N'ColumnOrder', @value=N'0' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO
EXEC sys.sp_addextendedproperty @name=N'ColumnWidth', @value=N'-1' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO
EXEC sys.sp_addextendedproperty @name=N'DataUpdatable', @value=N'False' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO
EXEC sys.sp_addextendedproperty @name=N'GUID', @value=N'ᗷ녮䶀䠔ắ消॓㩥' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero'
Do you have an option to ignore extendedproperty in mssql ?
I had tried to ingest the well known database AdventureWorks2019 and I got 70 errors. Most of them have similar messages (for different tables): Failed to ingest CreateTableRequest [ErrorLog] due to api request failure: Multiple columns tagged with primary key constraints Then I had started the SQL Server Profiler to find the appropriate ingestion SQL and found this:
exec sp_executesql N'/* {"app": "OpenMetadata", "version": "1.3.1.0"} */
SELECT [INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA], [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME], [INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME], [INFORMATION_SCHEMA].[COLUMNS].[IS_NULLABLE], [INFORMATION_SCHEMA].[COLUMNS].[DATA_TYPE], [INFORMATION_SCHEMA].[COLUMNS].[ORDINAL_POSITION], [INFORMATION_SCHEMA].[COLUMNS].[CHARACTER_MAXIMUM_LENGTH], [INFORMATION_SCHEMA].[COLUMNS].[NUMERIC_PRECISION], [INFORMATION_SCHEMA].[COLUMNS].[NUMERIC_SCALE], [INFORMATION_SCHEMA].[COLUMNS].[COLUMN_DEFAULT], [INFORMATION_SCHEMA].[COLUMNS].[COLLATION_NAME], sys.computed_columns.definition, sys.computed_columns.is_persisted, sys.identity_columns.is_identity, CAST(sys.identity_columns.seed_value AS NUMERIC(38, 0)) AS seed_value, CAST(sys.identity_columns.increment_value AS NUMERIC(38, 0)) AS increment_value, CAST(sys.extended_properties.value AS NVARCHAR(4000)) AS comment
FROM [INFORMATION_SCHEMA].[COLUMNS] LEFT OUTER JOIN sys.computed_columns ON sys.computed_columns.object_id = object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] + CAST(@P1 AS NVARCHAR(max)) + [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND sys.computed_columns.name = ([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT) LEFT OUTER JOIN sys.identity_columns ON sys.identity_columns.object_id = object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] + CAST(@P1 AS NVARCHAR(max)) + [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND sys.identity_columns.name = ([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT) LEFT OUTER JOIN sys.columns AS columns_1 ON columns_1.object_id = object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] + CAST(@P1 AS NVARCHAR(max)) + [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND columns_1.name = ([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT) LEFT OUTER JOIN sys.extended_properties ON sys.extended_properties.major_id = columns_1.object_id AND sys.extended_properties.minor_id = columns_1.column_id
WHERE [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME] = CAST(@P2 AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] = CAST(@P3 AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[COLUMNS].[ORDINAL_POSITION]',N'@P1 NVARCHAR(MAX),@P2 NVARCHAR(MAX),@P3 NVARCHAR(MAX)',@P1=N'.',@P2=N'ErrorLog',@P3=N'dbo'
In formated style it look like this:
SELECT [INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA],
[INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME],
[INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME],
[INFORMATION_SCHEMA].[COLUMNS].[IS_NULLABLE],
[INFORMATION_SCHEMA].[COLUMNS].[DATA_TYPE],
[INFORMATION_SCHEMA].[COLUMNS].[ORDINAL_POSITION],
[INFORMATION_SCHEMA].[COLUMNS].[CHARACTER_MAXIMUM_LENGTH],
[INFORMATION_SCHEMA].[COLUMNS].[NUMERIC_PRECISION],
[INFORMATION_SCHEMA].[COLUMNS].[NUMERIC_SCALE],
[INFORMATION_SCHEMA].[COLUMNS].[COLUMN_DEFAULT],
[INFORMATION_SCHEMA].[COLUMNS].[COLLATION_NAME],
sys.computed_columns.definition,
sys.computed_columns.is_persisted,
sys.identity_columns.is_identity,
CAST(sys.identity_columns.seed_value AS NUMERIC(38, 0)) AS seed_value,
CAST(sys.identity_columns.increment_value AS NUMERIC(38, 0)) AS increment_value,
CAST(sys.extended_properties.value AS NVARCHAR(4000)) AS comment
FROM [INFORMATION_SCHEMA].[COLUMNS]
LEFT OUTER JOIN sys.computed_columns ON sys.computed_columns.object_id =
object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] +
CAST('.' AS NVARCHAR(max)) +
[INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND
sys.computed_columns.name =
([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT)
LEFT OUTER JOIN sys.identity_columns ON sys.identity_columns.object_id =
object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] +
CAST('.' AS NVARCHAR(max)) +
[INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND
sys.identity_columns.name =
([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT)
LEFT OUTER JOIN sys.columns AS columns_1 ON columns_1.object_id =
object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] +
CAST('.' AS NVARCHAR(max)) +
[INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND
columns_1.name =
([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT)
LEFT OUTER JOIN sys.extended_properties ON sys.extended_properties.major_id = columns_1.object_id AND
sys.extended_properties.minor_id = columns_1.column_id
WHERE [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME] = CAST('ErrorLog' AS NVARCHAR(max))
AND [INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] = CAST('dbo' AS NVARCHAR(max))
ORDER BY [INFORMATION_SCHEMA].[COLUMNS].[ORDINAL_POSITION]
I had executed it and got the following result (I hided some not interesting columns):
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | comment |
|---|---|---|---|
| dbo | ErrorLog | ErrorLogID | Primary key for ErrorLog records. |
| dbo | ErrorLog | ErrorLogID | Clustered index created by a primary key constraint. |
| dbo | ErrorLog | ErrorTime | The date and time at which the error occurred. |
| dbo | ErrorLog | UserName | The user who executed the batch in which the error occurred. |
| dbo | ErrorLog | ErrorNumber | The error number of the error that occurred. |
| dbo | ErrorLog | ErrorSeverity | The severity of the error that occurred. |
| dbo | ErrorLog | ErrorState | The state number of the error that occurred. |
| dbo | ErrorLog | ErrorProcedure | The name of the stored procedure or trigger where the error occurred. |
| dbo | ErrorLog | ErrorLine | The line number at which the error occurred. |
| dbo | ErrorLog | ErrorMessage | The message text of the error that occurred. |
As you can mention the column ErrorLogID presented twice. Then I has tried to explain the following query:
SELECT *
FROM sys.extended_properties
WHERE major_id = object_id('dbo.ErrorLog');
and I got:
| class | class_desc | major_id | minor_id | name | value |
|---|---|---|---|---|---|
| 1 | OBJECT_OR_COLUMN | 613577224 | 0 | MS_Description | Audit table tracking errors in the the AdventureWorks database that are caught by the CATCH block of a TRY...CATCH construct. Data is inserted by stored procedure dbo.uspLogError when it is executed from inside the CATCH block of a TRY...CATCH construct. |
| 1 | OBJECT_OR_COLUMN | 613577224 | 1 | MS_Description | Primary key for ErrorLog records. |
| 1 | OBJECT_OR_COLUMN | 613577224 | 2 | MS_Description | The date and time at which the error occurred. |
| 1 | OBJECT_OR_COLUMN | 613577224 | 3 | MS_Description | The user who executed the batch in which the error occurred. |
| 1 | OBJECT_OR_COLUMN | 613577224 | 4 | MS_Description | The error number of the error that occurred. |
| 1 | OBJECT_OR_COLUMN | 613577224 | 5 | MS_Description | The severity of the error that occurred. |
| 1 | OBJECT_OR_COLUMN | 613577224 | 6 | MS_Description | The state number of the error that occurred. |
| 1 | OBJECT_OR_COLUMN | 613577224 | 7 | MS_Description | The name of the stored procedure or trigger where the error occurred. |
| 1 | OBJECT_OR_COLUMN | 613577224 | 8 | MS_Description | The line number at which the error occurred. |
| 1 | OBJECT_OR_COLUMN | 613577224 | 9 | MS_Description | The message text of the error that occurred. |
| 7 | INDEX | 613577224 | 1 | MS_Description | Clustered index created by a primary key constraint. |
As you can mention we have two rows where minor_id = 1: the first one has class_desc = OBJECT_OR_COLUMN and another one has Index. It means that this column has two description: the first one for the column and another one for the Index.
I suppose that this query should be appended with filter filter class_desc = 'OBJECT_OR_COLUMN; in that part of query:
...
LEFT OUTER JOIN sys.extended_properties ON sys.extended_properties.major_id = columns_1.object_id AND
sys.extended_properties.minor_id = columns_1.column_id AND
sys.extended_properties.class_desc = 'OBJECT_OR_COLUMN' AND -- <-- Filter by description of column.
sys.extended_properties.name = 'MS_Description'
...
Moreover, I think that it is also good to filter by sys.extended_properties.name = 'MS_Description', because that view can contain not only MS_Description but also user defined values.
I had started to diagnose this with a DB I'm trialing the product against and had come to more or less the same conclusion as @scrappyCoco. I've got a draft PR for the sys.extended_properties code fix here: https://github.com/open-metadata/OpenMetadata/pull/15433
Its currently in draft as I am trying to work out how to put some tests around it.
@kchenery @scrappyCoco feel free to join the contributors channel in the slack workspace, we can help guide you if needed somewhere, also, Thanks for the contribution, really appreciate that🙏
Update on my fix. Was asked to resubmit the fix without some linter changes and that has been merged. See PR 15530.
This addresses the problem of multiple columns being tagged (e.g. through Primary Keys, Indexes etc). You can still add extended properties for those. I also added a small QOL fix I spotted so that MS_Description values for views are also imported (as views show under "tables" in OpenMetadata).
thanks @kchenery for the PR! Can we close this issue then?
@pmbrull not really my issue to make that call I'm afraid 😃 I'm not sure about the error around "utf-16-le' codec can't decode bytes"
If @AnZelie can give us a table definition that fails to be imported, I'll use that to work on a fix though.
hi @AnZelie since we'd need more information to replicate this, I'm closing the issue for now. If you can please give us further details, we can reopen and work on it. Thanks