datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Database profiling allow_deny_patterns : Column deny does not work.

Open rsontam-tc opened this issue 2 years ago • 4 comments

Profiling allow_deny_patterns do not seem to work for deny in PostgreSQL database, data is profiled for the excluded/denied columns and shown on the stats page.

You may follow the steps below to reproduce the issue where column level deny is not working

Goal is to ingest the table and profile only some of the columns not all, specifically the columns named secret_value and secret_value2 should not be profiled.

Table definition

CREATE TABLE IF NOT EXISTS public.test_to_exclude_table_columns_from_datahub
(
    id integer NOT NULL,
    val character varying(255) COLLATE pg_catalog."default" NOT NULL,
    time_stamp timestamp without time zone NOT NULL DEFAULT clock_timestamp(),
    secret_value character varying(255) COLLATE pg_catalog."default" NOT NULL,
    secret_value2 character varying(255) COLLATE pg_catalog."default" NOT NULL,
    CONSTRAINT test_to_exclude_table_columns_from_datahub_pkey PRIMARY KEY (id)
)

Populate Data

insert into public.test_to_exclude_table_columns_from_datahub
select
	a.n
	,cast(concat(cast('val-' as varchar), to_char(a.n, '099999')) as varchar(255))
	,now()
	,(concat('secret-001:', substr(md5(random()::text), 0, 25)))::varchar(255)
	,(concat('secret-002:', substr(md5(random()::text), 0, 25)))::varchar(255)
from generate_series(20001,30000) as a(n)

Specification for allow_deny

Specification 1

 profiling:
      allow_deny_patterns:
        allow: 
         - .*
        deny:
          - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value*'
        ignoreCase: True
        alphabet: '[A-Za-z0-9 .-]'

OR

Specification 2

 profiling:
      allow_deny_patterns:
        allow: 
         - .*
        deny:
          - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value'
          - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value2'
        ignoreCase: True
        alphabet: '[A-Za-z0-9 .-]'

Expected behaviour Expected to see table profiled for all but the two columns secret_value and secret_value2.

Observed behaviour Data Hub shows data for the two excluded/denied columns as shown in the screenshots.

Screenshot 1 corresponds to Specification 1 image

Screenshot 2 corresponds to Specification 2 image

rsontam-tc avatar Aug 08 '22 21:08 rsontam-tc

FYI https://github.com/datahub-project/datahub/blob/33339e2c8933bb3b989b4052ed1b3d308624f2a0/metadata-ingestion/src/datahub/ingestion/source/ge_profiling_config.py#L80 Thanks!

rsontam-tc avatar Aug 08 '22 21:08 rsontam-tc

Looks like this is the same issue as #5589

hsheth2 avatar Aug 08 '22 23:08 hsheth2

Thanks @hsheth2 , I created two bugs in case table and column level denies had different methods for processing.

https://github.com/datahub-project/datahub/issues/5589 is for table level denies. https://github.com/datahub-project/datahub/issues/5590 is for column level denies.

Having said that I will try with the patterns you mentioned in https://github.com/datahub-project/datahub/issues/5589.

Cheers, Raghu

rsontam-tc avatar Aug 09 '22 00:08 rsontam-tc

Hello @hsheth2,

Some progress, the following seems to be working...

    profile_pattern:
      allow:
        - .*
      deny:
        - 'dvdrental.public.test_to_exclude_table_from_datahub_[3-5]'
        #- 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value'
        - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value2'    

However, specifying either or both of ignoreCase and alphabet as below causes a parse error...

    profile_pattern:
      allow:
        - .*
      deny:
        - 'dvdrental.public.test_to_exclude_table_from_datahub_[3-5]'
        #- 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value'
        - 'dvdrental.public.test_to_exclude_table_columns_from_datahub.secret_value2'     
        ignoreCase: True
        alphabet: '[A-Za-z0-9 .-]'

The error message is as below (regardless of either or both are used)

  File "/usr/local/lib/python3.7/dist-packages/yaml/parser.py", line 393, in parse_block_sequence_entry
    "expected <block end>, but found %r" % token.id, token.start_mark)
yaml.parser.ParserError: while parsing a block collection
  in "<file>", line 32, column 9
expected <block end>, but found '?'
  in "<file>", line 36, column 9

It would make sense to profile data without case-sensitivity.

Looking forward to your feedback. TIA!

rsontam-tc avatar Aug 09 '22 14:08 rsontam-tc

@rsontam-tc that looks like an indentation error with yaml - I think the ingoreCase and alphabet lines are indented one level too deep (they should be at the same level as allow/deny). As a side note, you shouldn't need to specify alphabet manually - we ought to hide it from the config

hsheth2 avatar Aug 11 '22 05:08 hsheth2

OK, let me try that option, TY

rsontam-tc avatar Aug 11 '22 15:08 rsontam-tc

Hi Harshal, specifying ignoreCase at the same level as profile_pattern as below...

    profile_pattern:
      allow:
        - .*
      deny: []
    ignoreCase: true

... causes the following error:

[2022-08-18 14:45:14,084] ERROR    {***.ingestion.run.pipeline:126} - 1 validation error for PostgresConfig
ignoreCase
  extra fields not permitted (type=value_error.extra)
[2022-08-18 14:45:14,084] INFO     {***.cli.ingest_cli:115} - Starting metadata ingestion
[2022-08-18 14:45:14,084] INFO     {***.cli.ingest_cli:133} - Finished metadata pipeline
Failed to configure source (postgres) due to 1 validation error for PostgresConfig
ignoreCase
  extra fields not permitted (type=value_error.extra)
Error: Process completed with exit code 1.

Cheers, Raghu

rsontam-tc avatar Aug 18 '22 15:08 rsontam-tc

I think the indentation is still slightly off - it should look like this:

    profile_pattern:
      allow:
        - .*
      deny: []
      ignoreCase: true

hsheth2 avatar Aug 18 '22 15:08 hsheth2

Hi @rsontam-tc Is this still an issue? if not, will close it after a few days of inactivity.

siddiquebagwan avatar Sep 02 '22 12:09 siddiquebagwan

This can be closed @mohdsiddique. It took me some time and attention to compare the changes and @hsheth2's recommendation was bang on. I will go ahead and mark it closed. Thanks.

rsontam-tc avatar Sep 02 '22 14:09 rsontam-tc