datahub
datahub copied to clipboard
Database profiling allow_deny_patterns : Column deny does not work.
Profiling allow_deny_patterns
do not seem to work for deny
in PostgreSQL database, data is profiled for the excluded/denied columns and shown on the stats page.
You may follow the steps below to reproduce the issue where column level deny is not working
Goal is to ingest the table and profile only some of the columns not all, specifically the columns named secret_value
and secret_value2
should not be profiled.
Table definition
CREATE TABLE IF NOT EXISTS public.test_to_exclude_table_columns_from_datahub
(
id integer NOT NULL,
val character varying(255) COLLATE pg_catalog."default" NOT NULL,
time_stamp timestamp without time zone NOT NULL DEFAULT clock_timestamp(),
secret_value character varying(255) COLLATE pg_catalog."default" NOT NULL,
secret_value2 character varying(255) COLLATE pg_catalog."default" NOT NULL,
CONSTRAINT test_to_exclude_table_columns_from_datahub_pkey PRIMARY KEY (id)
)
Populate Data
insert into public.test_to_exclude_table_columns_from_datahub
select
a.n
,cast(concat(cast('val-' as varchar), to_char(a.n, '099999')) as varchar(255))
,now()
,(concat('secret-001:', substr(md5(random()::text), 0, 25)))::varchar(255)
,(concat('secret-002:', substr(md5(random()::text), 0, 25)))::varchar(255)
from generate_series(20001,30000) as a(n)
Specification for allow_deny
Specification 1
profiling:
allow_deny_patterns:
allow:
- .*
deny:
- 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value*'
ignoreCase: True
alphabet: '[A-Za-z0-9 .-]'
OR
Specification 2
profiling:
allow_deny_patterns:
allow:
- .*
deny:
- 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value'
- 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value2'
ignoreCase: True
alphabet: '[A-Za-z0-9 .-]'
Expected behaviour
Expected to see table profiled for all but the two columns secret_value
and secret_value2
.
Observed behaviour Data Hub shows data for the two excluded/denied columns as shown in the screenshots.
Screenshot 1 corresponds to Specification 1
Screenshot 2 corresponds to Specification 2
FYI https://github.com/datahub-project/datahub/blob/33339e2c8933bb3b989b4052ed1b3d308624f2a0/metadata-ingestion/src/datahub/ingestion/source/ge_profiling_config.py#L80 Thanks!
Looks like this is the same issue as #5589
Thanks @hsheth2 , I created two bugs in case table and column level denies had different methods for processing.
https://github.com/datahub-project/datahub/issues/5589 is for table level denies. https://github.com/datahub-project/datahub/issues/5590 is for column level denies.
Having said that I will try with the patterns you mentioned in https://github.com/datahub-project/datahub/issues/5589.
Cheers, Raghu
Hello @hsheth2,
Some progress, the following seems to be working...
profile_pattern:
allow:
- .*
deny:
- 'dvdrental.public.test_to_exclude_table_from_datahub_[3-5]'
#- 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value'
- 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value2'
However, specifying either or both of ignoreCase
and alphabet
as below causes a parse error...
profile_pattern:
allow:
- .*
deny:
- 'dvdrental.public.test_to_exclude_table_from_datahub_[3-5]'
#- 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value'
- 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value2'
ignoreCase: True
alphabet: '[A-Za-z0-9 .-]'
The error message is as below (regardless of either or both are used)
File "/usr/local/lib/python3.7/dist-packages/yaml/parser.py", line 393, in parse_block_sequence_entry
"expected <block end>, but found %r" % token.id, token.start_mark)
yaml.parser.ParserError: while parsing a block collection
in "<file>", line 32, column 9
expected <block end>, but found '?'
in "<file>", line 36, column 9
It would make sense to profile data without case-sensitivity.
Looking forward to your feedback. TIA!
@rsontam-tc that looks like an indentation error with yaml - I think the ingoreCase and alphabet lines are indented one level too deep (they should be at the same level as allow/deny). As a side note, you shouldn't need to specify alphabet manually - we ought to hide it from the config
OK, let me try that option, TY
Hi Harshal, specifying ignoreCase
at the same level as profile_pattern
as below...
profile_pattern:
allow:
- .*
deny: []
ignoreCase: true
... causes the following error:
[2022-08-18 14:45:14,084] ERROR {***.ingestion.run.pipeline:126} - 1 validation error for PostgresConfig
ignoreCase
extra fields not permitted (type=value_error.extra)
[2022-08-18 14:45:14,084] INFO {***.cli.ingest_cli:115} - Starting metadata ingestion
[2022-08-18 14:45:14,084] INFO {***.cli.ingest_cli:133} - Finished metadata pipeline
Failed to configure source (postgres) due to 1 validation error for PostgresConfig
ignoreCase
extra fields not permitted (type=value_error.extra)
Error: Process completed with exit code 1.
Cheers, Raghu
I think the indentation is still slightly off - it should look like this:
profile_pattern:
allow:
- .*
deny: []
ignoreCase: true
Hi @rsontam-tc Is this still an issue? if not, will close it after a few days of inactivity.
This can be closed @mohdsiddique. It took me some time and attention to compare the changes and @hsheth2's recommendation was bang on. I will go ahead and mark it closed. Thanks.