OpenMetadata icon indicating copy to clipboard operation
OpenMetadata copied to clipboard

2 issues with the metadata integration job for an MSSQL database : [utf-16-le' codec can't decode bytes] + [Multiple columns tagged with primary key constraints]

Open AnZelie opened this issue 2 years ago • 4 comments
trafficstars

Hi,

I just installed OpenMetadata v1.1.7 with a Docker container and I'm having an issue with the metadata integration job for my MSSQL database.


1 ) utf-16-le' codec can't decode bytes :

e2dee958a89c *** Found local files: *** * /opt/airflow/logs/dag_id=bdGes_foo_TEST_metadata_sR1SBJPV/run_id=manual__2023-10-07T07:17:35+00:00/task_id=ingestion_task/attempt=1.log *** Could not read served logs: Parent instance <TaskInstance at 0x7f89b4a617f0> is not bound to a Session; lazy load operation of attribute 'trigger' cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3) [2023-10-07T07:17:36.402+0000] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: bdGes_foo_TEST_metadata_sR1SBJPV.ingestion_task manual__2023-10-07T07:17:35+00:00 [queued]> [2023-10-07T07:17:36.579+0000] {taskinstance.py:1103} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: bdGes_foo_TEST_metadata_sR1SBJPV.ingestion_task manual__2023-10-07T07:17:35+00:00 [queued]> [2023-10-07T07:17:36.579+0000] {taskinstance.py:1308} INFO - Starting attempt 1 of 1 [2023-10-07T07:17:36.593+0000] {taskinstance.py:1327} INFO - Executing <Task(PythonOperator): ingestion_task> on 2023-10-07 07:17:35+00:00 [2023-10-07T07:17:36.601+0000] {standard_task_runner.py:57} INFO - Started process 162537 to run task [2023-10-07T07:17:36.607+0000] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', 'bdGes_foo_TEST_metadata_sR1SBJPV', 'ingestion_task', 'manual__2023-10-07T07:17:35+00:00', '--job-id', '99', '--raw', '--subdir', 'DAGS_FOLDER/bdGes_foo_TEST_metadata_sR1SBJPV.py', '--cfg-path', '/tmp/tmpan1meypy'] [2023-10-07T07:17:36.608+0000] {standard_task_runner.py:85} INFO - Job 99: Subtask ingestion_task [2023-10-07T07:17:36.681+0000] {task_command.py:410} INFO - Running <TaskInstance: bdGes_foo_TEST_metadata_sR1SBJPV.ingestion_task manual__2023-10-07T07:17:35+00:00 [running]> on host e2dee958a89c [2023-10-07T07:17:36.797+0000] {taskinstance.py:1545} INFO - Exporting env vars: AIRFLOW_CTX_DAG_OWNER='admin' AIRFLOW_CTX_DAG_ID='bdGes_foo_TEST_metadata_sR1SBJPV' AIRFLOW_CTX_TASK_ID='ingestion_task' AIRFLOW_CTX_EXECUTION_DATE='2023-10-07T07:17:35+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='manual__2023-10-07T07:17:35+00:00' [2023-10-07T07:17:36.827+0000] {workflow.py:115} INFO - Service type:ServiceType.Database,mssql configured [2023-10-07T07:17:36.940+0000] {test_connections.py:215} INFO - Test connection results: [2023-10-07T07:17:36.940+0000] {test_connections.py:216} INFO - failed=[] success=["'CheckAccess': Pass", "'GetDatabases': Pass", "'GetSchemas': Pass", "'GetTables': Pass", "'GetViews': Pass", "'GetQueries': Pass"] warning=[] [2023-10-07T07:17:37.236+0000] {common_db_source.py:117} INFO - Ingesting from database: bdGes_foo_TEST [2023-10-07T07:17:38.231+0000] {common_db_source.py:424} WARNING - Unexpected exception to yield table [agessa_mem]: (builtins.UnicodeDecodeError) 'utf-16-le' codec can't decode bytes in position 14-15: illegal encoding (Background on this error at: https://sqlalche.me/e/14/dbapi)

image

--> The table is empty.


2 ) Multiple columns tagged with primary key constraints :

[2023-10-07T07:17:38.777+0000] {metadata_rest.py:147} WARNING - Failed to ingest CreateTableRequest [Art_notes] due to api request failure: Multiple columns tagged with primary key constraints

image


image

---> https://stackoverflow.com/questions/18357675/unicodedecodeerror-unexpected-end-of-data

AnZelie avatar Oct 07 '23 07:10 AnZelie

hi @AnZelie, thanks for the details. A couple of questions if I may:

  • Can you please share the schema of the table agessa_mem? Let's see if we can reproduce the error
  • For the table Art_notes, do you have more than one PK? (maybe a multicol PK?)

pmbrull avatar Oct 09 '23 05:10 pmbrull

It's strange, because if I recreate the table with a different name, I don't get any errors on the new table. CREATE TABLE [dbo].[agessa_mem]( [Numero] [int] IDENTITY(1,1) NOT NULL, [zzz] [nvarchar](50) NULL, [code_edit] [nvarchar](50) NULL, [agessa1] [int] NULL, [csg1] [int] NULL, [unpourc1] [int] NULL, [ret_total1] [int] NULL, [rds1] [int] NULL, [agessa2] [int] NULL, [agessa3] [int] NULL, [agessa4] [int] NULL, [csg2] [int] NULL, [csg3] [int] NULL, [csg4] [int] NULL, [rds2] [int] NULL, [rds3] [int] NULL, [rds4] [int] NULL, [unpourc2] [int] NULL, [unpourc3] [int] NULL, [unpourc4] [int] NULL, [ret_total2] [int] NULL, [ret_total3] [int] NULL, [ret_total4] [int] NULL, [groupe] [nvarchar](50) NULL, [annee] [nvarchar](50) NULL, [ages_recalc] [int] NULL, [csg_recalc] [int] NULL, [rds_recalc] [int] NULL, [unp_recalc] [int] NULL, [tot_recalc] [int] NULL, CONSTRAINT [aa_abcaaaaaagessa_mem_PK] PRIMARY KEY NONCLUSTERED ( [Numero] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY] ) ON [PRIMARY]

It seems that there isn't anything specific I've noticed regarding the primary key (PK). However, on my SQL server, several databases have the same table name with the same primary key name.

ALTER TABLE [dbo].[Art_notes] ADD CONSTRAINT [aaaaaArt_notes_PK] PRIMARY KEY NONCLUSTERED ( [Numero] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]

AnZelie avatar Oct 09 '23 08:10 AnZelie

NB : At the end of the create table script, i have this code :

EXEC sys.sp_addextendedproperty @name=N'AllowZeroLength', @value=N'False' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO

EXEC sys.sp_addextendedproperty @name=N'Attributes', @value=N'17' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO

EXEC sys.sp_addextendedproperty @name=N'CollatingOrder', @value=N'1036' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO

EXEC sys.sp_addextendedproperty @name=N'ColumnHidden', @value=N'False' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO

EXEC sys.sp_addextendedproperty @name=N'ColumnOrder', @value=N'0' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO

EXEC sys.sp_addextendedproperty @name=N'ColumnWidth', @value=N'-1' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO

EXEC sys.sp_addextendedproperty @name=N'DataUpdatable', @value=N'False' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero' GO

EXEC sys.sp_addextendedproperty @name=N'GUID', @value=N'ᗷ녮䶀䠔ắ消॓㩥' , @level0type=N'SCHEMA',@level0name=N'dbo', @level1type=N'TABLE',@level1name=N'agessa_mem', @level2type=N'COLUMN',@level2name=N'Numero'

AnZelie avatar Oct 09 '23 08:10 AnZelie

Do you have an option to ignore extendedproperty in mssql ?

AnZelie avatar Oct 09 '23 09:10 AnZelie

I had tried to ingest the well known database AdventureWorks2019 and I got 70 errors. Most of them have similar messages (for different tables): Failed to ingest CreateTableRequest [ErrorLog] due to api request failure: Multiple columns tagged with primary key constraints Then I had started the SQL Server Profiler to find the appropriate ingestion SQL and found this:

exec sp_executesql N'/* {"app": "OpenMetadata", "version": "1.3.1.0"} */
SELECT [INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA], [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME], [INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME], [INFORMATION_SCHEMA].[COLUMNS].[IS_NULLABLE], [INFORMATION_SCHEMA].[COLUMNS].[DATA_TYPE], [INFORMATION_SCHEMA].[COLUMNS].[ORDINAL_POSITION], [INFORMATION_SCHEMA].[COLUMNS].[CHARACTER_MAXIMUM_LENGTH], [INFORMATION_SCHEMA].[COLUMNS].[NUMERIC_PRECISION], [INFORMATION_SCHEMA].[COLUMNS].[NUMERIC_SCALE], [INFORMATION_SCHEMA].[COLUMNS].[COLUMN_DEFAULT], [INFORMATION_SCHEMA].[COLUMNS].[COLLATION_NAME], sys.computed_columns.definition, sys.computed_columns.is_persisted, sys.identity_columns.is_identity, CAST(sys.identity_columns.seed_value AS NUMERIC(38, 0)) AS seed_value, CAST(sys.identity_columns.increment_value AS NUMERIC(38, 0)) AS increment_value, CAST(sys.extended_properties.value AS NVARCHAR(4000)) AS comment 
FROM [INFORMATION_SCHEMA].[COLUMNS] LEFT OUTER JOIN sys.computed_columns ON sys.computed_columns.object_id = object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] + CAST(@P1 AS NVARCHAR(max)) + [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND sys.computed_columns.name = ([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT) LEFT OUTER JOIN sys.identity_columns ON sys.identity_columns.object_id = object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] + CAST(@P1 AS NVARCHAR(max)) + [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND sys.identity_columns.name = ([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT) LEFT OUTER JOIN sys.columns AS columns_1 ON columns_1.object_id = object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] + CAST(@P1 AS NVARCHAR(max)) + [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND columns_1.name = ([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT) LEFT OUTER JOIN sys.extended_properties ON sys.extended_properties.major_id = columns_1.object_id AND sys.extended_properties.minor_id = columns_1.column_id 
WHERE [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME] = CAST(@P2 AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] = CAST(@P3 AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[COLUMNS].[ORDINAL_POSITION]',N'@P1 NVARCHAR(MAX),@P2 NVARCHAR(MAX),@P3 NVARCHAR(MAX)',@P1=N'.',@P2=N'ErrorLog',@P3=N'dbo'

In formated style it look like this:

SELECT [INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA],
       [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME],
       [INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME],
       [INFORMATION_SCHEMA].[COLUMNS].[IS_NULLABLE],
       [INFORMATION_SCHEMA].[COLUMNS].[DATA_TYPE],
       [INFORMATION_SCHEMA].[COLUMNS].[ORDINAL_POSITION],
       [INFORMATION_SCHEMA].[COLUMNS].[CHARACTER_MAXIMUM_LENGTH],
       [INFORMATION_SCHEMA].[COLUMNS].[NUMERIC_PRECISION],
       [INFORMATION_SCHEMA].[COLUMNS].[NUMERIC_SCALE],
       [INFORMATION_SCHEMA].[COLUMNS].[COLUMN_DEFAULT],
       [INFORMATION_SCHEMA].[COLUMNS].[COLLATION_NAME],
       sys.computed_columns.definition,
       sys.computed_columns.is_persisted,
       sys.identity_columns.is_identity,
       CAST(sys.identity_columns.seed_value AS NUMERIC(38, 0))      AS seed_value,
       CAST(sys.identity_columns.increment_value AS NUMERIC(38, 0)) AS increment_value,
       CAST(sys.extended_properties.value AS NVARCHAR(4000))        AS comment
FROM [INFORMATION_SCHEMA].[COLUMNS]
         LEFT OUTER JOIN sys.computed_columns ON sys.computed_columns.object_id =
                                                 object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] +
                                                           CAST('.' AS NVARCHAR(max)) +
                                                           [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND
                                                 sys.computed_columns.name =
                                                 ([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT)
         LEFT OUTER JOIN sys.identity_columns ON sys.identity_columns.object_id =
                                                 object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] +
                                                           CAST('.' AS NVARCHAR(max)) +
                                                           [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND
                                                 sys.identity_columns.name =
                                                 ([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT)
         LEFT OUTER JOIN sys.columns AS columns_1 ON columns_1.object_id =
                                                     object_id([INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] +
                                                               CAST('.' AS NVARCHAR(max)) +
                                                               [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME]) AND
                                                     columns_1.name =
                                                     ([INFORMATION_SCHEMA].[COLUMNS].[COLUMN_NAME] COLLATE DATABASE_DEFAULT)
         LEFT OUTER JOIN sys.extended_properties ON sys.extended_properties.major_id = columns_1.object_id AND
                                                    sys.extended_properties.minor_id = columns_1.column_id
WHERE [INFORMATION_SCHEMA].[COLUMNS].[TABLE_NAME] = CAST('ErrorLog' AS NVARCHAR(max))
  AND [INFORMATION_SCHEMA].[COLUMNS].[TABLE_SCHEMA] = CAST('dbo' AS NVARCHAR(max))
ORDER BY [INFORMATION_SCHEMA].[COLUMNS].[ORDINAL_POSITION]

I had executed it and got the following result (I hided some not interesting columns):

TABLE_SCHEMA TABLE_NAME COLUMN_NAME comment
dbo ErrorLog ErrorLogID Primary key for ErrorLog records.
dbo ErrorLog ErrorLogID Clustered index created by a primary key constraint.
dbo ErrorLog ErrorTime The date and time at which the error occurred.
dbo ErrorLog UserName The user who executed the batch in which the error occurred.
dbo ErrorLog ErrorNumber The error number of the error that occurred.
dbo ErrorLog ErrorSeverity The severity of the error that occurred.
dbo ErrorLog ErrorState The state number of the error that occurred.
dbo ErrorLog ErrorProcedure The name of the stored procedure or trigger where the error occurred.
dbo ErrorLog ErrorLine The line number at which the error occurred.
dbo ErrorLog ErrorMessage The message text of the error that occurred.

As you can mention the column ErrorLogID presented twice. Then I has tried to explain the following query:

SELECT *
FROM sys.extended_properties
WHERE major_id = object_id('dbo.ErrorLog');

and I got:

class class_desc major_id minor_id name value
1 OBJECT_OR_COLUMN 613577224 0 MS_Description Audit table tracking errors in the the AdventureWorks database that are caught by the CATCH block of a TRY...CATCH construct. Data is inserted by stored procedure dbo.uspLogError when it is executed from inside the CATCH block of a TRY...CATCH construct.
1 OBJECT_OR_COLUMN 613577224 1 MS_Description Primary key for ErrorLog records.
1 OBJECT_OR_COLUMN 613577224 2 MS_Description The date and time at which the error occurred.
1 OBJECT_OR_COLUMN 613577224 3 MS_Description The user who executed the batch in which the error occurred.
1 OBJECT_OR_COLUMN 613577224 4 MS_Description The error number of the error that occurred.
1 OBJECT_OR_COLUMN 613577224 5 MS_Description The severity of the error that occurred.
1 OBJECT_OR_COLUMN 613577224 6 MS_Description The state number of the error that occurred.
1 OBJECT_OR_COLUMN 613577224 7 MS_Description The name of the stored procedure or trigger where the error occurred.
1 OBJECT_OR_COLUMN 613577224 8 MS_Description The line number at which the error occurred.
1 OBJECT_OR_COLUMN 613577224 9 MS_Description The message text of the error that occurred.
7 INDEX 613577224 1 MS_Description Clustered index created by a primary key constraint.

As you can mention we have two rows where minor_id = 1: the first one has class_desc = OBJECT_OR_COLUMN and another one has Index. It means that this column has two description: the first one for the column and another one for the Index. image

I suppose that this query should be appended with filter filter class_desc = 'OBJECT_OR_COLUMN; in that part of query:

...
         LEFT OUTER JOIN sys.extended_properties ON sys.extended_properties.major_id = columns_1.object_id AND
                                                    sys.extended_properties.minor_id = columns_1.column_id AND
                                                    sys.extended_properties.class_desc = 'OBJECT_OR_COLUMN' AND -- <-- Filter by description of column.
                                                    sys.extended_properties.name = 'MS_Description'
...

Moreover, I think that it is also good to filter by sys.extended_properties.name = 'MS_Description', because that view can contain not only MS_Description but also user defined values.

scrappyCoco avatar Mar 03 '24 13:03 scrappyCoco

I had started to diagnose this with a DB I'm trialing the product against and had come to more or less the same conclusion as @scrappyCoco. I've got a draft PR for the sys.extended_properties code fix here: https://github.com/open-metadata/OpenMetadata/pull/15433

Its currently in draft as I am trying to work out how to put some tests around it.

kchenery avatar Mar 04 '24 06:03 kchenery

@kchenery @scrappyCoco feel free to join the contributors channel in the slack workspace, we can help guide you if needed somewhere, also, Thanks for the contribution, really appreciate that🙏

ayush-shah avatar Mar 04 '24 17:03 ayush-shah

Update on my fix. Was asked to resubmit the fix without some linter changes and that has been merged. See PR 15530.

This addresses the problem of multiple columns being tagged (e.g. through Primary Keys, Indexes etc). You can still add extended properties for those. I also added a small QOL fix I spotted so that MS_Description values for views are also imported (as views show under "tables" in OpenMetadata).

kchenery avatar Mar 12 '24 19:03 kchenery

thanks @kchenery for the PR! Can we close this issue then?

pmbrull avatar Mar 13 '24 05:03 pmbrull

@pmbrull not really my issue to make that call I'm afraid 😃 I'm not sure about the error around "utf-16-le' codec can't decode bytes"

If @AnZelie can give us a table definition that fails to be imported, I'll use that to work on a fix though.

kchenery avatar Mar 13 '24 21:03 kchenery

hi @AnZelie since we'd need more information to replicate this, I'm closing the issue for now. If you can please give us further details, we can reopen and work on it. Thanks

pmbrull avatar Apr 18 '24 10:04 pmbrull